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BRIEF ON APPEAL 



Further to the Notice of Appeal filed November 27, 2001 and received by the USPTO on 
January 9, 2002, herewith are three copies of Appellants' Brief on Appeal. Appellants hereby request 
a two-month extension of time in order to file this Brief. Authorized fees include the statutory fee of 
$400.00 for a two-month extension of time, as well as the $ 320.00 fee for the filing of this Brief. 



This is an appeal from the decision of the Examiner finally rejecting Claims 30-40 of the above- 
identified application. 



(1) REAL PARTY IN INTEREST 

The above-identified application is assigned of record to Incyte Pharmaceuticals, Inc. (now 

Incyte Genomics, Inc.) (Reel 7984, Frame 0461) which is the real party in interest herein. 
05/16/2002 W0111 00000112 090108 09467100 
02 FC:120 320.00 CH 
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(2) RELATED APPEALS AND INTERFERENCES 
Appellants, their legal representative and the assignee are not aware of any related appeals or 
interferences which will directly affect or be directly affected by or have a bearing on the Board's 
decision in the instant appeal. 

i ■ 

(3) STATUS OF THE CLAIMS 
Claims 3(i-40 
(none) 

i 

Claims 1-29 
Claims 41 and 42 

Claims 30-40 (A copy of the claims on appeal, as amended, can be 
found in the attached Appendix). 



Claims rejected: 
Claims allowed: 
Claims canceled: 
Claims withdrawn: 
Claims on Appeal: 



(4) STATUS OF AMENDMENTS AFTER FINAL 
There were no amendments submitted after Final Rejection. 



(5) SUMMARY OF THE INVENTION 
Appellants' invention is directed, inter alia, to a polynucleotide encoding a polypeptide 
("HJAK2") having strong homology to mouse Jak2 kinase (IGIF) (GI 409584), and to naturally- 
occurring variants of such polynucleotides, such polynucleotides having a variety of utilities, in particular 
in expression profiling, and in particular for diagnosis of conditions or diseases characterized by 
expression of HJAK2, for toxicology testing, for drug discovery, and for chromosome mapping. (See 
the Specification at, e.g., page 1, line 10 through page 2, line 17, page 4, lines 1-11, page 14, line 24 
through page 16, line 3, and page 22, line 27 through page 23, line 8.) As described in the 
Specification (page 3, lines 24-36): 

Human Jak2 kinase (hjak2) was first identified as a partial nucleotide sequence in 
Incyte Clone 179527 during a computer search for nucleotide sequence alignments 
among the cDNAs of a placenta library. A modified XL-PCR procedure, specially 
designed oligonucleotides, and cDNAs of the placenta library were used to extend 
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Incyte Clone 179527 to full length. The assembled nucleotide sequence (SEQ ID No 
1), hjak2, encodes the polypeptide (SEQ ID No 2), HJAK2. Computer search and 
alignment of the full length amino acid sequence showed that HJAK2 has 92% similarity 
to murine Jak2 kinase (MUSPTK1; GenBank GI 409584; Wilks AF (1989) Proc Nat 
Acad Sci 86:1603-7) which in turn has 96% sequence similarity with human Jakl 
kinase. These homologies and the conserved residues, G 48 , K 73 , E 192 , and D 220 which 
all he within the catalytic domain contributed to the naming and uses of hjak2. 

(6) THE FINAL REJECTIONS 
Claims 36 and 37-40 stand rejected under 35 U.S.C. § 112, first paragraph, based on the 
allegation that the claimed polynucleotide variants lack an enabling disclosure. The rejection alleges in 
particular that "it is unclear what function a polypeptide variant with a mutation that results in 'altered 
activity' has and absent a teaching of this 'specific altered activity', how the polynucleotides which 
encode these polypeptides are enabled with respect to there [sic] use." (Final Office Action, page 4.) 

Claims 36 and 37-40 stand rejected under 35 U.S.C. § 1 12, first paragraph, based on the 
allegation that the claimed polynucleotide variants lack an adequate written description. The rejection 
alleges in particular that "the claimed genus of polypeptides [sic: polynucleotides] of Claim 36, part b) 
are not fully described by the specification as each of this claimed genus includes polynucleotides of a 
wide diversity of functions" and that the genus of "naturally-occurring polynucleotide sequences having 
92% sequence identity to the sequence of SEQ ID NO:l" is "at least so broad as to encompass all 
allelic variants of the polypeptide [sic: polynucleotide] of SEQ ID NO:l (and might include all allelic 
variants of other genes if there are multiple highly homologous loci)." (Final Office Action, page 5.) 

Claims 36-40 stand rejected under 35 U.S.C. § 1 12, first paragraph, based on the allegation 
that the claimed polynucleotide variants lack an adequate written description, under the allegation that 
"[t]he recitation in newly added claim 36 (37-40 dependent from) of '92% identity' is rejected as being 
new matter that is not supported by the original specification." (Final Office Action, page 6.) 
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Claims 37-40 stand rejected under 35 U.S.C. § 103(a), based on the allegation that the 
claimed invention is obvious over Silvennoinen et al., stating that "[o]ne of ordinary skill in the "art would 
have been motivated to use the sequence taught by Silvennoinen et al. to design oligomers for use as 
primers to amplify and determine the level of mRNA encoding the murine Jak2 protein or to isolate 
other mRNAs encoding related proteins such as human Jak2 using hybridization or polymerase chain 
reaction methodology/' (Final Office Action, page 7.) 

Claims 30-36 stand rejected under the judicially created doctrine of double patenting over 
claims 1-3 of U.S. Patent No. 5,914,393 (Final Office Action, page 9). 

Claims 37-40 stand rejected under the judicially created doctrine of obviousness-type double 
patenting over claims 1-3 of U.S. Patent No. 5,914,393 (Final Office Action, page 10). 

(7) ISSUES 

1. Whether one of ordinary skill in the art would know how to use the claimed 
polynucleotides of Claims 36-40 in e.g., in toxicology testing, drug development, and the diagnosis of 
disease, so as to satisfy the enablement requirement of 35 U.S.C. §112, first paragraph. 

2. Whether the polynucleotides of Claims 36-40 meet the written description requirement 
of 35 U.S.C. §112, first paragraph, with respect to the description of the claimed polynucleotide 
variants. 

3. Whether the polynucleotides of Claims 36-40 meet the written description requirement 
of 35 U.S.C. §112, first paragraph, with respect to whether the recitation of "92% identity" is new 
matter. 

4. Whether the methods of Claims 37-40 are obvious over the Silvennoinen et al. 
document. 
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5. Whether the polynucleotides of Claims 30-36 are covered by the judicially created 
doctrine of double patenting over claims 1-3 of U.S. Patent No. 5,914,393. 

6. Whether the methods of Claims 37-40 are covered by the judicially created doctrine of 
obviousness-type double patenting claims 1-3 of U.S. Patent No. 5,914,393. 

(8) GROUPING OF THE CLAIMS 

As to Issue 1 

Claims 36-40 are grouped together. 
As to Issue 2 

Claims 36-40 are grouped together. 
As to Issue 3 

Claims 36-40 are grouped together. 
As to Issue 4 

Claims 37-40 are grouped together. 
As to Issue 5 

Claims 30-36 are grouped together. 
As to Issue 6 

Claims 37-40 are grouped together. 

(9) APPELLANTS' ARGUMENTS 
Issue One: Enablement Rejection 

The rejection of Claims 36-40 is improper, as the specification provides an enabling disclosure 
for the claimed subject matter. The enablement requirement of 35 U.S.C. § 1 12, first paragraph, 
provides that an applicant must describe how to make and use what is claimed. Here, the Examiner 
does not allege that one of skill in the art could not make the subject matter encompassed by the claims 
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(See the last line on page 3 of the Office Action of August 28, 2001). Rather, the Examiner purports 
that the Specification does not describe how to use the subject matter defined by the claims. The 
Examiner's arguments in particular allege that the Specification does not provide an enabling disclosure 
because "[i]t is unclear what function a polypeptide variant [encoded by the claimed polynucleotide 
variant] with a mutation that results in 'altered activity' has and absent a teaching of this 'specific altered 
activity', how the polynucleotides which encode these polypeptides are enabled with respect to there 
[sic] use." (Final Office Action, page 4.) Such, however, is not the case. 

The invention at issue is a polynucleotide sequence corresponding to a gene that is expressed in 
human placenta tissue, as well as its naturally-occurring polynucleotide variants. The novel SEQ ID 
NO:l polynucleotide codes for a polypeptide demonstrated in the patent specification to be a member 
of the class of Jak kinases, whose biological functions include phosphorylating proteins piUyrosine 
residues. (Specification, pages 1-2.) As such, the claimed invention has numerous practical, beneficial 
uses in toxicology testing, drug development, and the diagnosis of disease, none of which requires 
knowledge of how the polypeptides coded for by the polynucleotides actually function. 

Appellants submit with this brief the Declaration of Dr. Tod Bedilion describing some of the 
practical uses of the claimed invention in gene and protein expression monitoring applications. The 
Bedilion Declaration demonstrates that the positions and arguments made by the Patent Examiner with 
respect to the utility of the claimed polynucleotides are without merit. 

The Bedilion Declaration describes, in particular, how the claimed expressed polynucleotides 

can be used in gene expression monitoring applications that were well-known at the time the patent 

application was filed, and how those applications are useful in developing drugs and monitoring their 

activity. Dr. Bedilion states that the claimed invention is a useful tool when employed as a highly 

specific probe in a cDNA microarray: 

Persons skilled in the art would appreciate that cDNA microarrays that contained the SEQ ID 
NO:l polynucleotide or its naturally-occurring variants would be a more useful tool than cDNA 
microarrays that did not contain the polynucleotides in connection with conducting gene 
expression monitoring studies on proposed (or actual) drugs for treating inflammation and 
oncogenesis for such purposes as evaluating their efficacy and toxicity. (Bedilion Declaration, f 
11) 
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The Patent Examiner contends that the claimed polynucleotides cannot be useful without 
precise knowledge of their biological function. But the law never has required knowledge of biological 
function to prove utility. It is the claimed invention's uses, not its functions, that are the subject of a 
proper analysis under the enablement requirement. 

In any event, as demonstrated by the Bedilion Declaration, the person of ordinary skill in the art 
can achieve beneficial results from the claimed polynucleotides in the absence of any knowledge as to 
the precise function of the proteins encoded by them. The uses of the claimed polynucleotides in gene 
expression monitoring applications are in fact independent of their precise function. 

I. The Applicable Legal Standard 

— Txuneet the utility requirement of sections 101 and 1 12 of the Patent Act, the patent applicant 

need only show that the claimed invention is practically useful," Anderson v. Natta, 480 F.2d 1392, 

1397, 178 USPQ 458 (CCPA 1973) and confers a "specific benefit" on the public. Brenner v. 

Manson, 383 U.S. 519, 534-35, 148 USPQ 689 (1966). As discussed in a recent Court of Appeals 

for the Federal Circuit case, this threshold is not high: 

An invention is "useful" under section 101 if it is capable of providing some identifiable benefit. 
See Brenner v. Manson, 383 U.S. 519, 534 [148 USPQ 689] (1966); Brooktree Corp. v. 
Advanced Micro Devices, Inc., 977 F.2d 1555, 1571 [24 USPQ2d 1401] (Fed. Cir. 1992) 
("to violate Section 101 the claimed device must be totally incapable of achieving a useful 
result"); Fuller v. Berger, 120 F. 274, 275 (7th Cir. 1903) (test for utility is whether invention 
"is incapable of serving any beneficial end"). 

Juicy Whip Inc. v. Orange Bang Inc., 51 USPQ2d 1700 (Fed. Cir. 1999). 

While an asserted utility must be described with specificity, the patent applicant need not 

demonstrate utility to a certainty. In Stiftung v. Renishaw PLC, 945 F.2d 1 173, 1 180, 20 USPQ2d 

1094 (Fed. Cir. 1991), the United States Court of Appeals for the Federal Circuit explained: 

An invention need not be the best or only way to accomplish a certain result, and it need only 
be useful to some extent and in certain applications: "[T]he fact that an invention has only limited 
utility and is only operable in certain applications is not grounds for finding lack of utility." 
Envirotech Corp. v. Al George, Inc., 730 F.2d 753, 762, 221 USPQ 473, 480 (Fed. Cir. 
1984). 
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The specificity requirement is not, therefore, an onerous one. If the asserted utility is described 
so that a person of ordinary skill in the art would understand how to use the claimed invention, if is 
sufficiently specific. See Standard Oil Co. v. Montedison, S.p.a., 212 U.S.P.Q. 327, 343 (3d Or. 
1981). The specificity requirement is met unless the asserted utility amounts to a "nebulous expression" 
such as "biological activity" or "biological properties" that does not convey meaningful information 
about the utility of what is being claimed. Cross v. Iizuka, 753 F.2d 1040, 1048 (Fed. Cir. 1985). 

In addition to conferring a specific benefit on the public, the benefit must also be "substantial." 
Brenner, 383 U.S. at 534. A "substantial" utility is a practical, "real-world" utility. Nelson v. Bowler, 
626 F.2d 853, 856, 206 USPQ 881 (CCPA 1980). 

If persons of ordinary skill in the art would understand that there is a "well-established" utility 
for the claimed invention, the threshold is met automatically and the applicant need not make any 
showing to demonstrate utility. Manual of Patent Examination Procedure at § 706.03(a). Only if there 
is no "well-established" utility for the claimed invention must the applicant demonstrate the practical 
benefits of the invention. Id. 

Once the patent applicant identifies a specific utility, the claimed invention is presumed to 
possess it. In re Cortright, 165 F.3d 1353, 1357, 49 USPQ2d 1464 (Fed. Cir. 1999); In re Brana, 
51 F.3d 1560, 1566; 34 USPQ2d 1436 (Fed. Cir. 1995). In that case, the Patent Office bears the 
burden of demonstrating that a person of ordinary skill in the art would reasonably doubt that the 
asserted utility could be achieved by the claimed invention. Id. To do so, the Patent Office must 
provide evidence or sound scientific reasoning. See In re hanger, 503 F.2d 1380, 1391-92, 183 
USPQ 288 (CCPA 1974). If and only if the Patent Office makes such a showing, the burden shifts to 
the applicant to provide rebuttal evidence that would convince the person of ordinary skill that there is 
sufficient proof of utility. Brana, 51 F.3d at 1566. The applicant need only prove a "substantial 
likelihood" of utility; certainty is not required. Brenner, 383 U.S. at 532. 

n. Uses of the claimed polynucleotides for diagnosis of conditions and disorders 

characterized by expression of HJAK2, for toxicology testing, and for drug discovery 
are sufficient utilities under 35 U.S.C. §§ 101 and 112, first paragraph 
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The claimed invention meets all of the necessary requirements for establishing a credible utility 
under the Patent Law: There are "well-established" uses for the claimed invention known to persons of 
ordinary skill in the art, and there are specific practical and beneficial uses for the invention disclosed in 
the patent application's specification. These uses are explained, in detail, in the Bedilion Declaration 
accompanying this brief. Objective evidence, not considered by the Patent Office, further corroborates 
the credibility of the asserted utilities. 

A. The use of the claimed polynucleotides for toxicology testing, drug discovery, 
and disease diagnosis are practical uses that confer "specific benefits" to the 
public 

The claimed invention has specific, substantial, real- world utility by virtue of its use in toxicology 
testing, drug development and disease diagnosis through gene expression profiling. These uses are 
explained in detail in the accompanying Bedilion Declaration. There is no dispute that the claimed 
invention is in fact a useful tool in cDNA microarrays used to perform gene expression analysis. That is 
sufficient to establish utility for the claimed polynucleotides. 

In his Declaration, Dr. Bedilion explains the many reasons why a person skilled in the art 
reading the Coleman '508 application (to which the present application claims priority) on December 5, 
1995 would have understood that application to disclose the claimed polynucleotides to be useful for a 
number of gene expression monitoring applications, e.g., as a highly specific probe for the expression of 
that specific polynucleotide in connection with the development of drugs and the monitoring of the 
activity of such drugs. (Bedilion Declaration at, e.g., fj[ 10-12). Much, but not all, of Dr. Bedilion's 
explanation concerns the use of the claimed polynucleotides in cDNA microarrays of the type first 
developed at Stanford University for evaluating the efficacy and toxicity of drugs, as well as for other 
applications. (Bedilion Declaration, fH 10- 1 1 )} 



*Dr. Bedilion also explained, for example, why persons skilled in the art would also appreciate, 
based on the Coleman '508 specification, that the claimed polynucleotide would be useful in connection 
with developing new drugs using technology, such as northern analysis, that predated by many years the 
development of the cDNA technology (Bedilion Declaration, f 12). 
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In connection with his explanations, Dr. Bedilion states that the "Coleman '508 application 
would have led a person skilled in the art on December 5, 1995 who was using gene expression 
monitoring in connection with working on developing new drugs for the treatment of oncogenesis and 
cancer to conclude that a cDNA microarray that contained the SEQ ID NO:l polynucleotide or its 
naturally-occurring variants would be a highly useful tool and to request specifically that any cDNA 
microarray that was being used for such purposes contain the SEQ ID NO:l polynucleotide or its 
naturally-occurring variants." (Bedilion Declaration, f 1 1 ). For example, as explained by Dr. Bedilion, 
"[p]ersons skilled in the art would appreciate that cDNA microarrays that contained the SEQ ID NO:l 
polynucleotide or its naturally-occurring variants would be a more useful tool than cDNA microarrays 
that did not contain the polynucleotides in connection with conducting gene expression monitoring 
studies on proposed (or actual) drugs for treating oncogenesis and cancer for such purposes as 
evaluating their efficacy and toxicity." Id. 

In support of those statements, Dr. Bedilion provided detailed explanations of how cDNA 
technology can be used to conduct gene expression monitoring evaluations, with citation to the Schena 
article (published October 20, 1995) showing the state of the art on December 5, 1995. (Bedilion 
Declaration, 1 f 10-1 1). While Dr. Bedilion's explanations in paragraph 1 1 of his Declaration include 
almost four pages of text and six subparts (a)-(f), he specifically states that his explanations are not "all- 
inclusive." Id. For example, with respect to toxicity evaluations, Dr. Bedilion had earlier explained 
how persons skilled in the art who were working on drug development on December 5, 1995 (and for 
several years prior to December 5, 1995) "without any doubt" appreciated that the toxicity (or lack of 
toxicity) of any proposed drug was "one of the most important criteria to be considered and evaluated 
in connection with the development of the drug" and how the teachings of the Coleman '508 application 
clearly include using differential gene expression analyses in toxicity studies (Bedilion Declaration, % 10). 

Thus, the Bedilion Declaration establishes that persons skilled in the art reading the Coleman 
'508 application at the time it was filed "would have wanted their cDNA microarray to have a probe as 
described in (i) because a microarray that contained such a probe (as compared to one that did not) 
would provide more useful results in the kind of gene expression monitoring studies using cDNA 
microarrays that persons skilled in the art have been doing since well prior to December 5, 1995." 
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(Bedilion Declaration, fll, item (f)). This, by itself, provides more than sufficient reason to compel the 
conclusion that the Coleman '508 application disclosed to persons skilled in the art at the time of its 
filing substantial, specific and credible real-world utilities for the claimed polynucleotides. 

Nowhere does the Patent Examiner address the fact that, as described on page 14, lines 24-36 
of the Coleman '508 application, the claimed polynucleotides can be used as highly specific probes on, 
for example, chips (such as cDNA microarrays) - probes that without question can be used to measure 
both the existence and amount of complementary RNA sequences known to be the expression 
products of the claimed polynucleotides. The claimed invention is not, in that regard, some random 
sequence whose value as a probe is speculative or would require further research to determine. 

Given the fact that the claimed polynucleotides are known to be expressed, their utility as 
measuring and analyzing instruments for expression levels is as indisputable as a scale's utility for 
measuring weight. This use as a measuring tool, regardless of how the expression level data ultimately 
would be used by a person of ordinary skill in the art, by itself demonstrates that the claimed invention 
provides an identifiable, real-world benefit that meets the utility requirement. Raytheon v. Roper, 724 
F.2d 951, (Fed. Cir. 1983) (claimed invention need only meet one of its stated objectives to be useful); 
In re Cortwright, 165 F.3d 1353, 1359 (Fed. Cir. 1999) (how the invention works is irrelevant to 
utility); MPEP § 2107 ("Many research tools such as gas chromatographs, screening assays, and 
nucleotide sequencing techniques have a clear, specific, and unquestionable utility (e.g., thev are useful 
in analyzing compounds )" (emphasis added)). 

The Bedilion Declaration shows that the Schena article confirms and further establishes the 
utility of cDNA microarrays in drug development gene expression monitoring applications at the time 
the Coleman '508 application was filed (Bedilion Declaration flO; Bedilion Exhibit A). 

B. The use of nucleic acids coding for proteins expressed by humans as tools for 
toxicology testing, drug discovery, and the diagnosis of disease is now 'Svell- 
established" 

The technologies made possible by expression profiling and the DNA tools upon which they 
rely are now well-established. The technical literature recognizes not only the prevalence of these 
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technologies, but also their unprecedented advantages in drug development, testing and safety 

assessment. These technologies include toxicology testing, as described by Bedilion in his Declaration. 

Toxicology testing is now standard practice in the pharmaceutical industry. See, e.g., John C. 

Rockett, et. al., Differential gene expression in drug metabolism and toxicology: practicalities, 

problems, and potential , Xenobiotica 29:655-691 (July 1999) (Reference No. 1): 

Knowledge of toxin-dependent regulation in target tissues is not solely an academic pursuit as 
much interest has been generated in the pharmaceutical industry to harness this technology in 
the early identification of toxic drug candidates, thereby shortening the developmental process 
and contributing substantially to the safety assessment of new drugs. (Reference No. 1), page 
656.) 

To the same effect are several other scientific publications, including Emile F. Nuwaysir, et al., 

Microarravs and Toxicology: The Advent of Toxico genomics . Molecular Carcinogenesis 24:153-159 

(1999) (Reference No. 2); Sandra Steiner and N. Leigh Anderson, Expression profiling in toxicology 

- potentials and limitations . Toxicology Letters 112-13:467-471 (2000) (Reference No. 3). 

Nucleic acids useful for measuring the expression of whole classes of genes are routinely 

incorporated for use in toxicology testing. Nuwaysir et al. describes, for example, a Human ToxChip 

comprising 2089 human clones, which were selected 

for their well-documented involvement in basic cellular processes as well as their responses to 
different types of toxic insult. Included on this list are DNA replication and repair genes, 
apoptosis genes, and genes responsive to PAHs and dioxin-like compounds, peroxisome 
proliferators, estrogenic compounds, and oxidant stress. Some of the other categories of genes 
include transcription factors, oncogenes, tumor suppressor genes, cyclins, kinases, 
phosphatases, cell adhesion and motility genes, and homeobox genes. Also included in this 
group are 84 housekeeping genes, whose hybridization intensity is averaged and used for signal 
normalization of the other genes on the chip. (Emphasis added.) 

See also Table 1 of Nuwaysir et al. (listing additional classes of genes deemed to be of special interest 

in making a human toxicology microarray). 

The more genes that are available for use in toxicology testing, the more powerful the technique. 

"Arrays are at their most powerful when they contain the entire genome of the species they are being 

used to study." John C. Rockett and David J. Dix, Application of DNA Arrays to Toxicology , 

Environ. Health Perspec. 107:681-685 (1999) (Reference No. 4, see page 683). Control genes are 

carefully selected for their stability across a large set of array experiments in order to best study the 
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effect of toxicological compounds. See attached email from the primary investigator on the Nuwaysir 
paper, Dr. Cynthia Afshari, to an Incyte employee, dated July 3, 2000, as well as the original message 
to which she was responding (Reference No. 5), indicating that even the expression of carefully 
selected control genes can be altered. Thus, there is no expressed gene which is irrelevant to screening 
for toxicological effects, and all expressed genes have a utility for toxicological screening. 

In fact, the potential benefit to the public, in terms of lives saved and reduced health care costs, 
are enormous. Recent developments provide evidence that the benefits of this information are already 
beginning to manifest themselves. Examples include the following: 

• In 1999, CV Therapeutics, an Incyte collaborator, was able to use Incyte gene 
expression technology, information about the structure of a known transporter gene, 
and chromosomal mapping location, to identify the key gene associated with Tangiers 
disease. This discovery took place over a matter of only a few weeks, due to the 
power of these new genomics technologies. The discovery received an award from the 
American Heart Association as one of the top 10 discoveries associated with heart 
disease research in 1999. 

• In an April 9, 2000, article published by the Bloomberg news service, an Incyte 
customer stated that it had reduced the time associated with target discovery and 
validation from 36 months to 18 months, through use of Incyte' s genomic information 
database. Other Incyte customers have privately reported similar experiences. The 
implications of this significant saving of time and expense for the number of drugs that 
may be developed and their cost are obvious. 

• In a February 10, 2000, article in the Wall Street Journal, one Incyte customer stated 
that over 50 percent of the drug targets in its current pipeline were derived from the 
Incyte database. Other Incyte customers have privately reported similar experiences. 
By doubling the number of targets available to pharmaceutical researchers, Incyte 
genomic information has demonstrably accelerated the development of new drugs. 

Because the Patent Examiner failed to address or consider the '"well-established" utilities for the 
claimed invention in toxicology testing, drug development, and the diagnosis of disease, the Examiner's 
rejections should be overturned regardless of their merit. 
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C. Objective evidence corroborates the utilities of the claimed invention 

There is, in fact, no restriction on the kinds of evidence a Patent Examiner may consider in 
determining whether a "real-world" utility exists. Indeed, "real-world" evidence, such as evidence 
showing actual use or commercial success of the invention, can demonstrate conclusive proof of utility. 
Raytheon v. Roper, 220 USPQ2d 592 (Fed. Cir. 1983); Nestle v. Eugene, 55 F.2d 854, 856, 12 
USPQ 335 (6th Cir. 1932). Indeed, proof that the invention is made, used or sold by any person or 
entity other than the patentee is conclusive proof of utility. United States Steel Corp, v. Phillips 
Petroleum Co., 865 F.2d 1247, 1252, 9 USPQ2d 1461 (Fed. Cir. 1989). 

Over the past several years, a vibrant market has developed for databases containing all 
expressed genes (along with the polypeptide translations of those genes), in particular genes having 
medical and pharmaceutical significance such as the instant sequence. (Note that the value in these 
databases is enhanced by their completeness, but each sequence in them is independently valuable.) 
The databases sold by Appellants' assignee, Incyte, include exactly the kinds of information made 
possible by the claimed invention, such as tissue and disease associations. Incyte sells its database 
containing the claimed sequence and millions of other sequences throughout the scientific community, 
including to pharmaceutical companies who use the information to develop new pharmaceuticals. 

Both Incyte's customers and the scientific community have acknowledged that Incyte's 
databases have proven to be valuable in, for example, the identification and development of drug 
candidates. As Incyte adds information to its databases, including the information that can be generated 
only as a result of Incyte's discovery of the claimed polynucleotides and its use of those polynucleotides 
on cDNA microarrays, the databases become even more powerful tools. Thus the claimed invention 
adds more than incremental benefit to the drug discovery and development process. 

m. The Patent Examiner's Rejections Are Without Merit 

A. The Precise Biological Role Or Function Of An Expressed Polynucleotide Is 
Not Required To Demonstrate Utility 
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The Patent Examiner's primary rejection of the claimed invention is based on the ground that, 
without information as to the precise "biological role" of the claimed invention, the claimed invention's 
utility is not sufficiently specific. 

It may be that detailed information on biological function are necessary to satisfy the 
requirements for publication in some technical journals, but they are not necessary to satisfy the 
requirements for obtaining a United States patent. The relevant question is not, as the Examiner would 
have it, whether it is known how or why the invention works, In re Cortwright, 165 F.3d 1353, 1359 
(Fed. Cir. 1999), but rather whether the invention provides an "identifiable benefit" in presently 
available form. Juicy Whip Inc. v. Orange Bang Inc., 185 F.3d 1364, 1366 (Fed. Cir. 1999). If the 
benefit exists, and there is a substantial likelihood the invention provides the benefit, it is useful. There 
can be no doubt, particularly in view of the Bedilion Declaration (at, e.g., fj[ 10-11, Bedilion), that the 
present invention meets this test. 

The threshold for determining whether an invention produces an identifiable benefit is low. 
Juicy Whip, 185 F.3d at 1366. Only those utilities that are so nebulous that a person of ordinary skill 
in the art would not know how to achieve an identifiable benefit and, at least according to the PTO 
guidelines, so-called "throwaway" utilities that are not directed to a person of ordinary skill in the art at 
all, do not meet the statutory requirement of utility. Utility Examination Guidelines, 66 Fed. Reg. 1092 
(Jan. 5, 2001). 

Knowledge of the biological function or role of a biological molecule has never been required to 

show real- world benefit. In its most recent explanation of its own utility guidelines, the PTO 

acknowledged so much (66 F.R. at 1095): 

[T]he utility of a claimed DNA does not necessarily depend on the function of the 
encoded gene product. A claimed DNA may have specific and substantial utility 
because, e.g., it hybridizes near a disease-associated gene or it has gene-regulating 
activity. 

By implicitly requiring knowledge of biological function for any claimed nucleic acid, the 
Examiner has, contrary to law, elevated what is at most an evidentiary factor into an absolute 
requirement of utility. Rather than looking to the biological role or function of the claimed invention, the 
Examiner should have looked first to the benefits it is alleged to provide. 
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B, Membership in a Class of Useful Products Can Be Proof of Utility 

Despite the uncontradicted evidence that the claimed variant polynucleotides are expressed 
polynucleotides, the Examiner refused to impute the utility of the members of the family of expressed 
polypeptides to the claimed variant polynucleotides. In the Final Office Action, the Patent Examiner 
takes the position that, unless Appellants can identify which particular biological function within the class 
of expressed polynucleotides is possessed by the claimed polynucleotides, enablement cannot be 
imputed. 

In order to demonstrate utility by membership in a class, the law requires only that the class not 
contain a substantial number of useless members. So long as the class does not contain a substantial 
number of useless members, there is sufficient likelihood that the claimed invention will have utility, and 
a rejection under 35 U.S.C. § 101 (and hence a rejection under 35 U.S.C. § 112, first paragraph , 
based on lack of an enabled use) is improper. That is true regardless of how the claimed invention 
ultimately is used and whether or not the members of the class possess one utility or many. See 
Brenner v. Manson, 383 U.S. 519, 532 (1966); Application of Kirk, 376 F.2d 936, 943 (CCPA 



Membership in a "general" class is insufficient to demonstrate utility only if the class contains a 
sufficient number of useless members such that a person of ordinary skill in the art could not impute 
utility by a substantial likelihood. There would be, in that case, a substantial likelihood that the claimed 
invention is one of the useless members of the class. In the few cases in which class membership did 
not prove utility by substantial likelihood, the classes did in fact include predominately useless members. 
E.g., Brenner (man-made steroids); Kirk (same); Natta (man-made polyethylene polymers). 

The Examiner addresses the claimed variant polynucleotides as if the general class in which it is 
included is not the family of expressed polynucleotides, but rather all polynucleotides, including the vast 
majority of useless theoretical molecules not occurring in nature, and thus not pre-selected by nature to 
be useful. While these "general classes" may contain a substantial number of useless members, the 
family of expressed polynucleotides does not. The family of expressed polynucleotides is sufficiently 
specific to rule out any reasonable possibility that the claimed variant polynucleotides would not also be 
useful like the other members of the family. 



1967). 
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Because the Examiner has not presented any evidence that the family of expressed 
polynucleotides has any, let alone a substantial number, of useless members, the Examiner must 
conclude that there is a "substantial likelihood" that the claimed variant polynucleotides are useful. 

C. Because the uses of the claimed polynucleotides in toxicology testing, drug 
discovery, and disease diagnosis are practical uses beyond mere study of the 
invention itself, the claimed invention has substantial utility* 

As used in toxicology testing, drug discovery, and disease diagnosis, the claimed invention has a 
beneficial use in research other than studying the claimed invention or its protein products. It is a tool, 
rather than an object, of research. The data generated in gene expression monitoring using the claimed 
invention as a tool is not used merely to study the claimed polynucleotides themselves, but rather to 
study properties of tissues, cells, and potential drug candidates and toxins. Without the claimed 
invention, the information regarding the properties of tissues, cells, drug candidates and toxins is less 
complete. (Bedilion Declaration at SI 11) 

The claimed invention has numerous additional uses as a research tool, each of which alone is a 
"substantial utility." These include uses in chromosomal mapping (Specification, page 15, line 14, 
through page 16, line 3). 

IV. By Requiring the Patent Applicant to Assert a Particular or Unique Utility, the Patent 
Examination Utility Guidelines and Training Materials Applied by the Patent 
Examiner Misstate the Law 

There is an additional, independent reason to overturn the rejections: to the extent the rejections 
are based on Revised Interim Utility Examination Guidelines (64 FR 71427, December 21, 1999), the 
final Utility Examination Guidelines (66 FR 1092, January 5, 2001) and/or the Revised Interim Utility 
Guidelines Training Materials (USPTO Website www.uspto.gov, March 1, 2000), the Guidelines and 
Training Materials are themselves inconsistent with the law. 

The Training Materials, which direct the Examiners regarding how to apply the Utility 
Guidelines, address the issue of specificity with reference to two kinds of asserted utilities: "specific" 
utilities which meet the statutory requirements, and "general" utilities which do not. The Training 
Materials define a "specific utility" as follows: 
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A [specific utility] is specific to the subject matter claimed. This contrasts to general 
utility that would be applicable to the broad class of invention. For example, a claim. to 
a polynucleotide whose use is disclosed simply as "gene probe" or "chromosome 
marker" would not be considered to be specific in the absence of a disclosure of a 
specific DNA target. Similarly, a general statement of diagnostic utility, such as 
diagnosing an unspecified disease, would ordinarily be insufficient absent a disclosure of 
what condition can be diagnosed. 

The Training Materials distinguish between "specific" and "general" utilities by assessing 
whether the asserted utility is sufficiently "particular," i.e., unique (Training Materials at p.52) as 
compared to the "broad class of invention." (In this regard, the Training Materials appear to parallel 
the view set forth in Stephen G. Kunin, Written Description Guidelines and Utility Guidelines , 82 
J.P.T.O.S. 77, 97 (Feb. 2000) ("With regard to the issue of specific utility the question to ask is 
whether or not a utility set forth in the specification is particular to the claimed invention.")). 

Such "unique" or particular" utilities never have been required by the law. To meet the utility 
requirement, the invention need only be "practically useful " Natta, 480 F.2d 1 at 1397, and confer a 
"specific benefit" on the public. Brenner, 383 U.S. at 534. Thus, incredible "throwaway" utilities, such 
as trying to "patent a transgenic mouse by saying it makes great snake food," do not meet this standard. 
Karen Hall, Genomic Warfare , The American Lawyer 68 (June 2000) (quoting John Doll, Chief of the 
Biotech Section of USPTO). 

This does not preclude, however, a general utility, contrary to the statement in the Training 
Materials where "specific utility" is defined (page 5). Practical real-world uses are not limited to uses 
that are unique to an invention. The law requires that the practical utility be "definite," not particular. 
Montedison, 664 F.2d at 375. Appellant is not aware of any court that has rejected an assertion of 
utility on the grounds that it is not "particular" or "unique" to the specific invention. Where courts have 
found utility to be too "general," it has been in those cases in which the asserted utility in the patent 
disclosure was not a practical use that conferred a specific benefit. That is, a person of ordinary skill in 
the art would have been left to guess as to how to benefit at all from the invention. In Kirk, for 
example, the CCPA held the assertion that a man-made steroid had "useful biological activity" was 
insufficient where there was no information in the specification as to how that biological activity could be 
practically used. Kirk, 376 F.2d at 941. 
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The fact that an invention can have a particular use does not provide a basis for requiring a 
particular use. See Brana, supra (disclosure describing a claimed antitumor compound as being 
homologous to an antitumor compound having activity against a "particular" type of cancer was 
determined to satisfy the specificity requirement). "Particularity" is not and never has been the sine qua 
non of utility; it is, at most, one of many factors to be considered. 

As described supra, broad classes of inventions can satisfy the utility requirement so long as a 
person of ordinary skill in the art would understand how to achieve a practical benefit from knowledge 
of the class. Only classes that encompass a significant portion of nonuseful members would fail to meet 
the utility requirement. Supra § III.B. (Montedison, 664 F.2d at 374-75). 

The Training Materials fail to distinguish between broad classes that convey information of 
practical utility and those that do not, lumping all of them into the latter, unpatentable category of 
"general" utilities. As a result, the Training Materials paint with too broad a brush. Rigorously applied, 
they would render unpatentable whole categories of inventions that heretofore have been considered to 
be patentable and that have indisputably benefitted the public, including the claimed invention. See 
supra § III.B. Thus the Training Materials cannot be applied consistently with the law. 

Issue Two: Written Description Rejection with respect to the description of the variants 

The Examiner rejected Claims 36-40 under U.S.C. § 1 12 first paragraph, "as containing 
subject matter which was not described in the specification in such a way as to reasonably convey to 
one skilled in the relevant art that the inventor(s), at the time the application was filed, had possession of 
the claimed invention." (Final Office Action, page 4.) In particular, the Office Action asserts that the 
Specification does not provide adequate written description of the polynucleotide 'Variants" recited by 
the claims. This rejection is respectfully traversed. 

The requirements necessary to fulfill the written description requirement of 35 U.S.C. 112, first 

paragraph, are well established by case law. 

... the applicant must also convey with reasonable clarity to those skilled in 
the art that, as of the filing date sought, he or she was in possession of the invention. 
The invention is, for purposes of the "written description" inquiry, whatever is now 
claimed. Vas-Cath t Inc. v. Mahurkar, 19 USPQ2d 1111, 1117 (Fed. Cir. 1991) 
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Attention is also drawn to the Patent and Trademark Office's own "Guidelines for Examination 
of Patent Applications Under the 35 U.S.C. Sec. 112, para. 1", published January 5, 2001, which 
provide that : 

An applicant may also show that an invention is complete by disclosure of sufficiently 
detailed, relevant identifying characteristics 42 which provide evidence that applicant was 
in possession of the claimed invention, 43 i.e., complete or partial structure, other 
physical and/or chemical properties, functional characteristics when coupled with a 
known or disclosed correlation between function and structure, or some combination of 
such characteristics. 44 What is conventional or well known to one of ordinary skill in the 
art need not be disclosed in detail. 45 If a skilled artisan would have understood the 
inventor to be in possession of the claimed invention at the time of filing, even if every 
nuance of the claims is not explicitly described in the specification, then the adequate 
description requirement is met 46 

Thus, the written description standard is fulfilled by both what is specifically disclosed and what 
is conventional or well known to one skilled in the art. 

SEQ ID NO:l and SEQ ID NO:2 are specifically disclosed in the application (see, for 
example, the Sequence Listing, pages 35 through 40. Variants of SEQ ID NO: 2 are described, for 
example, at page 7, lines 20-26. Murine Jak2 kinase having 92% sequence identity to the human Jak2 
kinase of the present invention is described at, e.g., page 3, lines 31-34 and Figure 2. Incyte clones in 
which the nucleic acids encoding human Jak2 kinase were first identified and libraries from which those 
clones were isolated are described, for example, at page 3, lines 24 through 29 of the Specification. 
Chemical and structural features of the human Jak2 kinase are described, for example, on page 3, lines 
31 through 36. Given SEQ ID NO: 1 and SEQ ID NO:2, one of ordinary skill in the art would 
recognize naturally-occurring variants having greater than 92% sequence identity to SEQ ID NO:l and 
SEQ ID NO:2, respectively. The specification describes how to use BLAST to determine whether a 
given sequence falls within the "greater than 92% sequence identity" scope (e.g., page 19, line 16 
through page 20, line 27). Accordingly, the Specification provides an adequate written description of 
the recited polynucleotide and polypeptide sequences. 
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A. The present claims specifically define the claimed genus through the recitation 
of chemical structure 

Court cases in which "DNA claims" have been at issue commonly emphasize that the recitation 

of structural features or chemical or physical properties are important factors to consider in a written 

description analysis of such claims. For example, in Fiers v. Revel, 25 USPQ2d 1601, 1606 (Fed. 

Cir. 1993), the court stated that: 

If a conception of a DNA requires a precise definition, such as by structure, formula, 
chemical name or physical properties, as we have held, then a description also requires 
that degree of specificity. 

In a number of instances in which claims to DNA have been found invalid, the courts have 

noted that the claims attempted to define the claimed DNA in terms of functional characteristics without 

any reference to structural features. As set forth by the court in University of California v. Eli Lilly 

and Co., 43 USPQ2d 1398, 1406 (Fed. Cir. 1997): 

In claims to genetic material, however, a generic statement such as "vertebrate insulin 
cDNA" or "mammalian insulin cDNA," without more, is not an adequate written 
description of the genus because it does not distinguish the claimed genus from others, 
except by function. 

Thus, the mere recitation of functional characteristics of a DNA, without the definition of 
structural features, has been a common basis by which courts have found invalid claims to DNA. For 
example, in Lilly, 43 USPQ2d at 1407, the court found invalid for violation of the written description 
requirement the following claim of U.S. Patent No. 4,652,525: 

1. A recombinant plasmid replicable in procaryotic host containing within its nucleotide 
sequence a subsequence having the structure of the reverse transcript of an mRNA of a 
vertebrate, which mRNA encodes insulin. 

In Fiers, 25 USPQ2d at 1603, the parties were in an interference involving the following count: 

A DNA which consists essentially of a DNA which codes for a human fibroblast 
interferon-beta polypeptide. 
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Party Revel in the Fiers case argued that its foreign priority application contained an adequate 
written description of the DNA of the count because that application mentioned a potential method for 
isolating the DNA. The Revel priority application, however, did not have a description of any particular 
DNA structure corresponding to the DNA of the count. The court therefore found that the Revel 
priority application lacked an adequate written description of the subject matter of the count. 

Thus, in Lilly and Fiers, nucleic acids were defined on the basis of functional characteristics 
and were found not to comply with the written description requirement of 35 U.S.C. §112; i.e., "an 
mRNA of a vertebrate, which mRNA encodes insulin" in Lilly, and "DNA which codes for a human 
fibroblast interferon-beta polypeptide" in Fiers. In contrast to the situation in Lilly and Fiers, the 
claims at issue in the present application define polynucleotides and polypeptides in terms of chemical 
structure, rather than on functional characteristics. For example, the 'Variant language" of independent 
Claim 36 recites chemical structure to define the claimed genus: 

36. An isolated polynucleotide comprising a polynucleotide sequence selected 
from the group consisting of: . . . 

b) a naturally occurring polynucleotide sequence having greater than 92% 
sequence identity to the polynucleotide sequence of SEQ ID NO:l, . . . 

From the above it should be apparent that the claims of the subject application are 
fundamentally different from those found invalid in Lilly and Fiers. The subject matter of the present 
claims is defined in terms of the chemical structure of SEQ ID NO:l. In the present case, there is no 
reliance merely on a description of functional characteristics of the polynucleotides and polypeptides 
recited by the claims. In fact, there is no recitation of functional characteristics for the claimed 
polynucleotides encoding polypeptide variants. Moreover, if such functional recitations were included, 
it would add to the structural characterization of the recited polynucleotides and polypeptides. The 
polynucleotides and polypeptides defined in the claims of the present application recite structural 
features, and cases such as Lilly and Fiers stress that the recitation of structure is an important factor to 
consider in a written description analysis of claims of this type. By failing to base its written description 
inquiry "on whatever is now claimed," the OfSce Action failed to provide an appropriate analysis of the 
present claims and how they differ from those found not to satisfy the written description requirement in 
Lilly and Fiers. 
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B. The present claims do not define a genus which is highly variant 

Furthermore, the claims at issue do not describe a genus which could be characterized as highly 
variant. Available evidence illustrates that the claimed genus is of narrow scope. 

In support of this assertion, the Examiner's attention is directed to the enclosed reference by 
Brenner et al. ("Assessing sequence comparison methods with reliable structurally identified distant 
evolutionary relationships/'Proc. Natl. Acad. Sci. USA (1998) 95:6073-6078; Reference No. 6). 
Through exhaustive analysis of a data set of proteins with known structural and functional relationships 
and with <40% overall sequence identity, Brenner et al. have determined that 30% identity is a reliable 
threshold for establishing evolutionary homology between two sequences aligned over at least 150 
residues. (Brenner et al., pages 6073 and 6076.) Furthermore, local identity is particularly important in 
this case for assessing the significance of the alignments, as Brenner et al. further report that ^40% 
identity over at least 70 residues is reliable in signifying homology between proteins. (Brenner et al., 
page 6076.) 

The present application is directed, inter alia, to polynucleotides encoding polypeptides 
proteins related to the amino acid sequence of SEQ ID NO:2. In accordance with Brenner et al, 
naturally occurring molecules may exist which could be characterized as Jak2 kinase proteins and which 
have as little as 40% identity over at least 70 residues to SEQ ID NO:2. The 'Variant language"of the 
present claims recites, for example, polynucleotides comprising "a naturally occurring polynucleotide 
sequence having greater than 92% sequence identity to the polynucleotide sequence of SEQ ID NO:l." 
This variation is far less than that of all potential Jak2 kinase proteins related to SEQ ID NO:2, i.e., 
those Jak2 kinase proteins having as little as 40% identity over at least 70 residues to SEQ ID NO:2. 

C. The state of the art at the time of the present invention is further advanced 
than at the time of the Lilly and Fiers applications 

In the Lilly case, claims of U.S. Patent No. 4,652,525 were found invalid for failing to comply 
with the written description requirement of 35 U.S.C. §112. The '525 patent claimed the benefit of 
priority of two applications, Application Serial No. 801,343 filed May 27, 1977, and Application Serial 
No. 805,023 filed June 9, 1977. In the Fiers case, party Revel claimed the benefit of priority of an 
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Israeli application filed on November 21, 1979. Thus, the written description inquiry in those case was 
based on the state of the art at essentially at the "dark ages" of recombinant DNA technology. 

The present application has a priority date of December 5, 1995. Much has happened in the 
development of recombinant DNA technology in the 16 or more years from the time of filing of the 
applications involved in Lilly and Fiers and the present application. For example, the technique of 
polymerase chain reaction (PCR) was invented. Highly efficient cloning and DNA sequencing 
technology has been developed. Large databases of protein and nucleotide sequences have been 
compiled. Much of the raw material of the human and other genomes has been sequenced. With these 
remarkable advances one of skill in the art would recognize that, given the sequence information of 
SEQ ID NO:l and SEQ ID NO:2, and the additional extensive detail provided by the subject 
application, the present inventors were in possession of the claimed polynucleotide variants at the time 
of filing of this application. 

D. Summary 

The Office Action failed to base its written description inquiry "on whatever is now claimed." 
Consequently, the Action did not provide an appropriate analysis of the present claims and how they 
differ from those found not to satisfy the written description requirement in cases such as Lilly and 
Fiers. In particular, the claims of the subject application are fundamentally different from those found 
invalid in Lilly and Fiers. The subject matter of the present claims is defined in terms of the chemical 
structure of SEQ ID NO:l and SEQ ID NO:2. The courts have stressed that structural features are 
important factors to consider in a written description analysis of claims to nucleic acids and proteins. 
Furthermore, there have been remarkable advances in the state of the art since the Lilly and Fiers 
cases, and these advances were given no consideration whatsoever in the position set forth by the Final 
Office Action. 

Issue Three: Written Description Rejection with respect to new matter 

The Examiner rejected Claims 36-40 under 35 U.S.C. §112, first paragraph, as allegedly 
containing new matter with respect to the recitation of "92% sequence identity" (see part b of claim 
36). 
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As discussed above in connection with Issue Two, case law provides that to fulfill the written 
description requirement of 35 U.S.C. §112, first paragraph, " . . the applicant must also convey with 
reasonable clarity to those skilled in the art that, as of the filing date sought, he or she was in possession 
of the invention. The invention is, for purposes of the "written description" inquiry, whatever is now 
claimed 7 Vas-Cath, Inc. v. Mahurkar, 19 USPQ2d 1111, 1117 (Fed. Cir. 1991). Consideration of 
the originally filed application shows that Appellants were in possession of what is now claimed, Le. t "a 
naturally occurring polynucleotide sequence having greater than 92% sequence identity to the 
polynucleotide sequence of SEQ ID NO:l." 

In this regard, see the following portions of the Specification: 

"Naturally occurring HJAK2" refers to a polypeptide produced by cells which have not 
been genetically engineered or which have been genetically engineered to produce the 
same sequence as that naturally produced. (Specification at page 7, lines 8-10) 

As a result of the degeneracy of the genetic code, a multitude of HJAK2-encoding 
nucleotide sequences may be produced and some of these will bear only minimal 
homology to the endogenous sequence of any known and naturally occurring Jak2 
kinase sequence. This invention has specifically contemplated each and every possible 
variation of nucleotide sequence that could be made by selecting combinations based 
on possible codon choices. These combinations are made in accordance with the 
standard triplet genetic code as applied to the nucleotide sequence of naturally 
occurring HJAK2 and all such variations are to be considered as being specifically 
disclosed. (Specification at page 10, lines 4-12) 

The assembled nucleotide sequence (SEQ ID No 1), hjak2, encodes the polypeptide 
(SEQ ID No 2), HJAK2. Computer search and alignment of the full length amino acid 
sequence showed that HJAK2 has 92% similarity to murine Jak2 kinase (MUSPTK1; 
GenBank GI 409584; Wilks AF (1989) Proc Nat Acad Sci 86:1603-7) which in turn 
has 96% sequence similarity with human Jakl kinase. These homologies and the 
conserved residues, G^, K 73 , E 192 , and D 22 o which all lie within the catalytic domain 
contributed to the naming and uses of hjak2. (Specification at page 3, lines 29-36) 

Before the present sequences, variants, formulations and methods for making and using 
the invention are described, it is to be understood that the invention is not to be limited 
only to the particular sequences, variants, formulations or methods described. The 
sequences, variants, formulations and methodologies may vary, and the terminology 
used herein is for the purpose of describing particular embodiments. The terminology 
and definitions are not intended to be limiting since the scope of protection will 
ultimately depend upon the claims. (Specification at page 9, lines 9-16) 
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Thus, while the originally filed application does not contain a verbatim recitation of the present 
'V2% sequence identity" claim language, it is apparent that the inventors contemplated naturally 
occurring polynucleotide and polypeptide sequences of Jak2 kinase molecules. Moreover, the 
inventors were aware of the Wilks murine Jak2 kinase, which has 92% similarity to the amino acid 
sequence of SEQ ID NO:2. Hence, it is axiomatic that the present inventors considered naturally 
occurring polynucleotide sequences having greater than 92% sequence identity to the polynucleotide 
sequence of SEQ ID NO:l as part of their invention, i.e., those naturally occurring polynucleotide 
sequences which were not part of the prior art. 

Accordingly, the "92% sequence identity" language appearing in part b of Claim 36 does not 
represent new matter. 

Issue Four: Obviousness Rejection 

The Examiner rejected Claims 37-40 under 35 U.S.C. § 103(a) as being unpatentable over 
Silvennoinen et al. The Examiner alleged that "an oligomer of the polynucleotide comprising the nucleic 
acid sequence of SEQ ID NO:l is made obvious by Silvennoinen" and that "[o]ne of ordinary skill in 
the art at the time of filing would be motivated to use the sequence taught by Silvennoinen et al. to 
design oligomers for use as primers to amplify and determine the level of mRNA encoding the murine 
Jak2 protein or to isolate other mRNAs encoding related proteins such as human Jak2 using 
hybridization or polymerase chain reaction methodology." (Final Office Action, page 7.) 

Appellants respectfully submit that the Examiner has mischaracterized Appellants' claims, and 
continues to fail to give proper consideration to the entire claims in making the rejection. 

Appellants' rejected claims are as follows: 

37. A method for detecting a target polynucleotide in a sample, said target 
polynucleotide having a sequence of a polynucleotide of claim 36, the method 
comprising: 

a) hybridizing the sample with a probe comprising at least 16 contiguous 
nucleotides comprising a sequence complementary to said target polynucleotide in the 
sample, and which probe specifically hybridizes to said target polynucleotide, under 
conditions whereby a hybridization complex is formed between said probe and said 
target polynucleotide or fragments thereof, and 

b) detecting the presence or absence of said hybridization complex, and, 
optionally, if present, the amount thereof. 
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38. A method of claim 37, wherein the probe comprises at least 30 contiguous 
nucleotides. 

39. A method of claim 37, wherein the probe comprises at least 60 contiguous 
nucleotides. 

40. A method for detecting a target polynucleotide in a sample, said target 
polynucleotide having a sequence of a polynucleotide of claim 36, the method 
comprising: 

a) amplifying said target polynucleotide or fragment thereof using polymerase 
chain reaction amplification, and 

b) detecting the presence or absence of said amplified target polynucleotide or 
fragment thereof, and, optionally, if present, the amount thereof. 

Appellants note that in all four of claims, drawn to methods of detecting specific 
polynucleotides, the preamble to the claim contains the limitation "said target polynucleotide having a 
sequence of a polynucleotide of claim 36." 

Appellants respectfully submit that the rejection fails to state a proper prima facie case of 
obviousness, and that the rejection should, therefore, be reversed. 



The Examiner has mischaracterized the claims 

First and foremost, this rejection is inapt because the Examiner has failed to cite any references 

which, either alone or in combination, would render obvious the claimed methods, which relate to 

methods of detecting a specific, particular sequence . No matter how obvious it might have been to try 

to detect the specific full length sequence claimed in Claims 37-40, even assuming, arguendo, that it 

might be obvious to try to detect an unknown full length sequence based on the existence of a gene 

encoding a mouse Jak2 kinase in the prior art, 

. . . [o]bvious to try" has long been held not to constitute obviousness. In re O'Farrell, 
853 F.2d 894, 903, 7 USPQ2d 1673, 1680-81 (Fed. Cir. 1988). A general 
incentive does not make obvious a particular result, nor does the existence of 
techniques by which those efforts can be carried out. 

In re Deuel, 34 USPQ2d 1210 (CAFC 1995). 

The Examiner alleges that the method of detecting a polynucleotide of SEQ ID NO:l is 

obvious, because a mouse Jak2 kinase gene (e.g., the cited Silvennoinen et al. reference) was 
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identified. However, it is respectfully pointed out that the mouse Jak2 kinase gene was NOT identified 
as being a part of Appellants' claimed sequence of SEQ ID NO:l, which had not yet been elucidated. 
What might have been obvious is the wish to know the human homolog of the mouse Jak2 kinase gene, 
but the Silvennoinen reference does not discloses the human sequence. Moreover, while it is also true 
that the Silvennoinen mouse Jak2 kinase gene sequence (or, more correctly, the complement thereof) 
might have been useful to detect the human Jak2 kinase full length sequence, it was not so used by 
Silvennoinen, and in any case, the corresponding human Jak2 kinase full length sequence of SEQ ID 
NO:l was not known until Appellants elucidated it. 

Appellants do not claim a method for detecting all polynucleotides encoding Jak2 kinases. 
Appellants claim a method for detecting the polynucleotides of Claim 36. The Examiner continues to 
improperly construe the claim language by failing to give weight to the limitation of the preamble "said 
target polynucleotide having a sequence of a polynucleotide of claim 36." 

As was discussed in Pitney Bowes Inc. v. Hewlett-Packard Co., 51 USPQ2d 1161 (Fed. 
Cir 1999): 

If the claim preamble, when read in the context of the entire claim, recites limitations of 
the claim, or, if the claim preamble is "necessary to give life, meaning, and vitality" to 
the claim, then the claim preamble should be construed as if in the balance of the claim. 
Kropa v. Robie, 187 F.2d 150, 152, 88 USPQ 478, 480-81 (CCPA 1951); see 
also, 112 F.3d 473, 478, 42 USPQ2d 1550, 1553 (Fed. Cir. 1997); Corning Glass 
Works v. Sumitomo Elec. U.S.A., Inc., 868 F.2d 1251, 1257, 9 USPQ2d 1962, 
1966 (Fed. Cir. 1989). Indeed, when discussing the "claim" in such a circumstance, 
there is no meaningful distinction to be drawn between the claim preamble and the rest 
of the claim, for only together do they comprise the "claim". 

Thus, it is clear that the Examiner cannot disregard the limitation recited in the preamble, i.e., 
that the product detected is a specific sequence, and that sequence is not only novel, it is unobvious 
itself . The only imaginable process of detection claim that might be obvious over the cited prior art 
would be the wish to find a naturally-occurring sequence that is fully complementary to the complement 
of the Silvennoinen mouse Jak2 kinase gene, i.e., a method of detecting the Silvennoinen mouse Jak2 
kinase gene; that is not what Claims 37-40 encompass. 

Therefore, Appellants submit that the Examiner has clearly failed to establish a proper prima 
facie case of obviousness, as 
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1) the Examiner has failed to construe the claims properly; and 

2) the skilled worker could only, at best, have hoped to detect the exact complement of a 
complement to the Silvennoinen mouse Jak2 kinase gene, not the full length sequence of 
SEQ ID NO:l. 

Thus, the cited art could not render the claimed methods obvious; since the entire sequence of 
SEQ ID NO: 1 was not known, there was no way that detecting it, as compared to the Silvennoinen 
mouse Jak2 kinase gene, could be obvious. 

Reversal of the rejections is respectfully submitted to be proper and necessary. 

Issue Five: Double Patenting Rejection 

Claims 30-36 were rejected under the judicially created doctrine of double patenting over 
Claims 1-3 of U.S. Patent No. 5,914,393, with the Examiner relying on the case of In re Schneller, 
158 USPQ 210 (CCPA 1968). While not conceding the propriety of the Examiner's position, 
Appellants are willing to submit a Terminal Disclaimer with respect to U.S. Patent No. 5, 914, 393 in 
the interest of expediting prosecution of the subject application. Therefore, it is requested that the 
Board indicate that the subject application will be allowable upon submission of such a Terminal 
Disclaimer. 

Issue Six: Obviousness-type Double Patenting Rejection 

Claims 37-40 were rejected under the judicially created doctrine of obviousness-type double 
patenting over Claims 1-3 of U.S. Patent No. 5,914,393. While not conceding the propriety of the 
Examiner's position, Appellants are willing to submit a Terminal Disclaimer with respect to U.S. Patent 
No. 5,914, 393 in the interest of expediting prosecution of the subject application. Therefore, it is 
requested that the Board indicate that the subject application will be allowable upon submission of such 
a Terminal Disclaimer. 
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(10) CONCLUSION 

Appellants respectfully submit that rejections for lack of utility based, inter alia, on an 
allegation of "lack of specificity," as set forth in the Office Action and as justified in the Revised Interim 
and final Utility Guidelines and Training Materials, are not supported in the law. Neither are they 
scientifically correct, nor supported by any evidence or sound scientific reasoning. These rejections are 
alleged to be founded on facts in court cases such as Brenner and Kirk, yet those facts are clearly 
distinguishable from the facts of the instant application, and indeed most if not all nucleotide and protein 
sequence applications. Nevertheless, the PTO is attempting to mold the facts and holdings of these 
prior cases, "like a nose of wax," to target rejections of claims to polypeptide and polynucleotide 
sequences where biological activity information has not been proven by laboratory experimentation, and 
they have done so by ignoring perfectly acceptable utilities fully disclosed in the specification as well as 
well-established utilities known to those of skill in the art. As is disclosed in the specification, and even 
more clearly, as one of ordinary skill in the art would understand, the claimed invention has well- 
established, specific, substantial and credible utilities. The enablement ejections are, therefore, 
improper and should be reversed. The written description, obviousness, and double patenting 
rejections should also be reversed. 

Moreover, to the extent the above rejections were based on the Revised Interim and final 
Examination Guidelines and Training Materials, those portions of the Guidelines and Training Materials 
that form the basis for the rejections should be determined to be inconsistent with the law. 

Due to the urgency of this matter, including its economic and public health implications, an 
expedited review of this appeal is earnestly solicited. 
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If the USPTO determines that any additional fees are due, the Commissioner is hereby 
authorized to charge Deposit Account No. 09-0108. 
This brief is enclosed in triplicate* 

Respectfully submitted, 
INCYTE GENOMICS, INC. 



Date: 



—-J A?Tl Susan K. Sather U 



3160 Porter Drive 
Palo Alto, California 94304 
Phone: (650) 855-0555 
Fax: (650) 849-8886 




Reg. No. 44,316 

Direct Dial Telephone: (650) 845-4646 
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APPENDIX - CLAIMS ON APPEAL 



30. An isolated polynucleotide encoding a polypeptide selected from the group consisting of: 

a) an amino acid sequence of SEQ ID NO:2, and 

b) a fragment of an amino acid sequence of SEQ ID NO:2, wherein said fragment has kinase 
activity. 

31. An isolated polynucleotide of claim 30 which encodes a polypeptide comprising the amino 
acid sequence of SEQ ID NO:2. 

32. An isolated polynucleotide of claim 30 which encodes a polypeptide comprising a fragment 
of an amino acid sequence of SEQ ID NO:2, wherein said fragment has kinase activity. 

33. A recombinant polynucleotide comprising a promoter sequence operably linked to a 
polynucleotide of claim 30. 

34. A cell transformed with a recombinant polynucleotide of claim 33. 

35. A method for producing a polypeptide encoded by the polynucleotide of claim 30, the 
method comprising: 
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a) culturing a cell under conditions suitable for expression of the polypeptide, wherein said cell 

is transformed with a recombinant polynucleotide, and said recombinant polynucleotide comprises a 
promoter sequence operably linked to a polynucleotide of claim 30, and 

b) recovering the polypeptide so expressed. 

36. An isolated polynucleotide comprising a polynucleotide sequence selected from the group 
consisting of: 

a) the polynucleotide sequence of SEQ ID NO:l, 

b) a naturally occurring polynucleotide sequence having greater than 92% sequence identity to 
the polynucleotide sequence of SEQ ID NO:l, 

c) a polynucleotide sequence complementary to a), 

d) a polynucleotide sequence complementary to b), and 

e) an RNA equivalent of a)-d). 

37. A method for detecting a target polynucleotide in a sample, said target polynucleotide 
having a sequence of a polynucleotide of claim 36, the method comprising: 

a) hybridizing the sample with a probe comprising at least 16 contiguous nucleotides 
comprising a sequence complementary to said target polynucleotide in the sample, and which probe 
specifically hybridizes to said target polynucleotide, under conditions whereby a hybridization complex 
is formed between said probe and said target polynucleotide or fragments thereof, and 
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b) detecting the presence or absence of said hybridization complex, and, optionally, if present, 

the amount thereof. 

38. A method of claim 37, wherein the probe comprises at least 30 contiguous nucleotides. 

39. A method of claim 37, wherein the probe comprises at least 60 contiguous nucleotides. 

40. A method for detecting a target polynucleotide in a sample, said target polynucleotide 
having a sequence of a polynucleotide of claim 36, the method comprising: 

a) amplifying said target polynucleotide or fragment thereof using polymerase chain reaction 
amplification, and 

b) detecting the presence or absence of said amplified target polynucleotide or fragment 
thereof, and, optionally, if present, the amount thereof. 
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Introduction 

ncoil«nc 0 H W aPParCm ^ devel °P ment °f all cancers and man v non . 

co™L ST" ^ a "° mpan ' ed * iitered » *e affected cdl, 

^TP^ ,i, , 7° nnaJ Sate (Huntcr 199L ^nferd-Thoma, 1Q91 Voeelsrem 
and iunzler 1993. Semenza 1994. Cassidy 1995. Kleinjanand Van Heenm^ 1 998) 
Such change, abo occur in response to external st.mul, such as pzZ^^ro. 

998 and xenob.ot.es (Sewall tt al. 1995, Dogra et al. 1998. Ramana and Kohli 

Rud ^ " dUnng thC deve,0 P me ^ ^ undifferentiated cells (HeTht 1 998 

Rudin and Thompson 1998. Schneider-Maunoury « ,998). The potential 
medical and therapeutic benefit, of understanding the molecula change, w ch 

TrZZ m **7 n e S. iD Pr0gr " Sing fr ° m lhc no "" al " th « 'altereT/uTe 
enormous. Such profihng essent ially provides a fingerpnnt • of each step I 
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cell s development or response and should help m the elucidation of specific and 
,ens,t,ve biomarker, representing, for example, different type, of cancer or previous 
— exposure to certain cU.se. of chemical, that are enzvme inducer, 

In drug metabolwn. many of the xenobiouc-metabolizing enzvmes (including 
the weli-characterized isoforms of cytochrome P450) are mducbie bv dru« anl 

" of nTon!^ - ,998 i: '"C^S 
proteins unich ma> be crucial to the phenomenon of induction According th. 

and .therefore, of value in rationalizing the .olSS^Zr 
induced roxicitv. Knowledge of toxin-dependent gene regulation ,n targe dssu^ is 
not solely an academic pursuit as much interest has been genemed t ^he 
Pharmaceutical industry to harness thi, technology ,n the earlv idend^cat on of toxk 
drug candidates thereby shortening the developmental process and ^SSS 
substantially to the safety assessment of new drugs. For examole if the 

determined m the testu. then thi, profile would be representative of all new 
candidate, w h ,ch act via thi, specific molecular mechanism of toxicin thereof 
providing a useful and coherent approach to the early detection of such toxic»« 
uo/do"4" 5 e . ,nformarive 10 «»aw the ,dent,ry and functional of all g" « 

up/down regulated by , uch tox.cants. this would appear a longer term goal ' £ 

~ J"- ^ CUrrCm U " ° f gCne P r0fihn 8 yields a pattern ofTe™ 

cEST H " n 0t u ° f Unkn ° Wn ,OX,C,r >' Wh,ch mav be matched to that S 
characterized toxins, thus alerting the toxicologic to possible ,n nro sfmilaririe, 
between the unknown and the standard, thereby providing a planormTo m o« 
exxe^ve toxicologic*] examinanon. Such approaches are bettnZg to ^ 
72cZ>' ™ ' ^ b . , °L echnoJ °«>- — - commercial 'product 
xenliJT •£ T th3t ma> ' bC "» wo *«< d ^r toxicitv assessment of 

xenobioncs These chip, consist of hundreds/thousand, of genes. some of which are 

screed °fI C v Phen0men0n ' *~ chips are useful in oroad-spec^ 

ZZ *' y .T aWnng 41 1 substa "«4.r«e, in that gene arravs arTn^J 
becommg more specific, e.g. chips for the identification of changes in gmwth facto" 

as" COmnbUtC 10 aen ° ,0gy ^^P™" «* chemic^^d 

fo^M dOC , UmCntU : 8 " d «P laini "8"^ genetic changes present, a 
disT. e to " a l °K Und " Standin 8 < he ™«™ mechanisms of development and 
drse„e progression, the technology is now av^kb^to begin attempting this difficult 
challenge^ Indeed several 'differential expression analysis' methods h ve 
developed which facilitate the idenrificat.on of gene products that demons^ 
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Differential gene expression . . 

altered expression ,n cells of one population compared to another TW m ,. h 
have been used to .denrify different* gene expression » manv S^Z^l 
mv.dmg pathogeny microbes (Zhao et aL 1 998), in cells responding 

„ Q , € \ * ,99 * ' m /hem,cally treated cell, (Syed et aL 1997. Rockett et at 
1999), neoplastic cells (Liang et aL 1992, Chang and Terzaeh.-H^we 1998 

rj . r L,„„. • .... ,998) - A,t hough differential expression analv,, s 

dv»^ 

whtoTI? f °" Ca,e$ ' ,bM,utel y ™ Prior knowledge of the spec.ric eenes 

w hich are up. or down-regulated is required. ^ g 

mJ?r^**J m °T?? eXP v re "' 0n b a ,ar *< and com P'« on.. w,th 

s^L^hl^ " VW , lab,e W ? e P ° teniiaI T *»» c ™ oe categomed ,nto 
" v eral methodological approaches, including: 

(1) Differential screening, 

(2) Subtracrive hybridization (SH) (include, method, such as chemical cro... 

"'traction-CCLS. "PPression-PCR subtracts hvbridtat on - 
SSH. and representational difference analvsis-RDA) 

(3) Differential displav (DD), 

^Z^Sf^M faC,IiUted analySi$ (indUd,n8 S " ial <> f 

«pression-SAGE-and gene expression fingerprinting-GEF) 
(i) Gene expression arrays, and 
(6) Expressed sequence tag (EST) analysis. 

exo™fr e aPPr ? ache$ have been u " d successfully to isolate differentially 
Xle ^/ m ,fferem ^ SySWmS - However « each -«hod ha, S own 
nd di^ ^ n ° l 80 SUbde) charac »"'«« which incur various advantages 

mgnagnt some of the broader considerations and implications of this verv oowerful 
and increasmgly popular technique. Specifically, we will concen^T^ 
called open systems, namely those which do not require anv know ed« of™ 

*. approach employed in this laboratory), it ,s the a,m of the authorsTwgifight' 

d^nTr ' lh ° Se are " ° f COmm ° n intCre " 10 wh ° -e. or\ntend g » t 
ditterennal gene expression analysis. -- - rouse. 

Differential cDNA library screening (DS) 

^ e d «7lopment of multiple technological advances which have recently 

reTolitil f . gCne eXPreSS '° n PrQfilin8 W the f0ref — 0^ -olecu ar a^y, , 
recognition of the importance of differential gene expression and characterizSof 
differentially expressed genes ha, existed for many vears OnT^ ^ r 

Dav„ (1979). These author, developed a method, termed 'differential plaque filter 
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v^SITk ' h WM T* l ° "° la,e ^""'-inducible OXA sequence, from 

SILISfS ? or »«» ra / ti "" »d multiple filter replica, are 

e3nfeS$T "S? ^ * re - Pr0bed W " h t« othenv lS e) labelled 

Th^R va P ?T pi ?T d £r ° m COnlrol and «* «» »*NA population,. 
Tho,e mRNA, which are differenrially expre»ed in the treated cell population wtf 

Furthermore, labelled cDNA from different test condition, can be used to probe 
mulnple blott. thereby enabling the identification of mRNA, which are onlv up! 
regulated under certain conditions. For example. St John and Davi, . 1 979) , c l "! d 
rephca filter, with acetate., glucose- and galactose-derived proVes in orde to" btam 
genes induced .pecificallv by gal*to.e metaboli,m. Ahhou Jh grounXei ngin ,« 
time *» method i, now conaidered iwetuitive and time-consuming a, 7 0 to " 

eTpres,Vd"n r r: ed * "T^ f id ™*™°» ^Z^ZZZw 
that Z tlS J P ° pul fT- IO additi ° n * there is no convenient wav to check 
that the procedure ha, worked until the xvhole proce,, ha, been completed. 

Subtractive Hybridization (SH) 

The developing concept of differential gene expression and the succe,, of earlv 
approaches such a, that described by St John and Davi, (19 79) ,oon gave ri elo . 
search for more convenient method, of analy,i,. One of the fir,t to be develop wa! 
SH. numerous variation, of which have ,ince been repoaed (see below). In Jene^T 

o™m^ 

Z£Z I - , ? A fr ° m an0lher (driver)< followed bv separation of the 
urAybncUzed tester fraction (differentially expressed) from the hvbr^dized commor! 

^• Th 1 tt " e > Sieved physically, chemically and through tTute 

of selective polymerase chain reaction (PCR) techniques. 

Physical separation 

of hSESSL" UbtraCtiVe hybridiMtion «chnolog>- involved the phy,ical separation 
o h> bndued common spec.es from unique single stranded spedes. Several method, 
o, achie^nj , tfau have been described, including hydroxvapame chromarT^phv 

^Xo^T »?r d 1983)> a " d T- b>0 - (DogSd and DinaueT^) 

and ohgodT-latex ,eparat,on (Hara et al. 1991). I„ the first approach, common 
mRNA ,pec.e, are removed by cDNA (from t«t cells)-mRNA (from control 
subtractive hybridization foUowed by hydroxyapatite chromatography L hydroxy 

£ZZ°:1 I ad :° rbS ^ cDNA ' mRN ' A h ^rids. The un'abLbed cDNA* 
then used either for the constructs of a cDNA library of differentially exoresaed 
gene, (Sargent and Dawid 1983. Schneider et al. 1988) or direct^ » a oroT^ 
screen a Prelected library 

1984). A schematic diagram of the procedure i, shown in figure 1 

PC r" " g0r0US PhyS '? 1 5e , paration Procedure, coupled with sensitivity enhancing 
PCR step, were later developed as a mean, to overcome some of the problem 

(1990) de,cnbed a method of .ubtracrion utilizing biotin-affinirv ,y,tems a, a meanl 

LTmRN ^ T d C ° mm r >eqUCnCeS - 10 Pr0 " SS - bo * *• <o" ro" ™ 
tester mRNA populations are first convened to cDNA and an adaptor ('oligovector \ 
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Producedones Label directly and probe library 
'^"^JS^ h^duano, cONA denved from 

franco xrt remc^by^cU™ tne£eovery of fail length elon*. „ul| eDNA 
■ » vector for ,eX£mg oH.t u d T'T ™* rem,mmg cDNA * " e ,h « cl °»"<l '»«» 

■nd D,w.d (1983) reC, ' y ,0 Pr0be 1 Hbr,r >- « ^ S.rgcn, 

the adaptor-containing restriction endonucleaae. This serves to cleave Ae ^I 
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Cormol (driver) mRNA 



Test (tester) mRNA 
— AAAA 



I Anned mRNA to porydT. latex beads 




Centrifuge beads, colled and store supernatant 
dissociate poryA, reapply supernatant 
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Tester-specific mRNA retneved after 
4 rounds of hybridization 
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cDNAsyntness 



Ugate adaptors and insert into vector 

I 

~~~ - . . - _ 

Sequence inserts and** cary out ~ 
other downstream applications 

zZfc^.vzs t^ssr™ mRNA — <™ - 

..uched to Uu, bod.. mRNA from £«^h,^ U * m f ^'^T oligonucleotide. 

«e«er .peci/ic ud can be convened into cDNA for el J£. , j 1? , P°P"'««»" of mRNA it 
de*rribed by Han r( «/. (!99i). *" d 0,her d< »»"«"«n. .pplictioru. u 
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») mRNA comro1 cDNA ; ^ order to further enrich those species differentially expressed m 

~" " tne lester cDN A. the subtracted tester population is amplified by PCR following 

every second subtraction cycle. After six cycles of subtraction ( three reamplincation 
steps) the reaction mix is ligated into a vector for further analvsis. 

In a slightly different approach. Hara et al. (1991) utilized a method wherebv 
oligo(dT w ) primers attached to a latex substrate are used to first capture mRV* 
extracted from the control population. Following 1st strand cDNA svnthesis the 
RNA strand of the heteroduplexes is removed by heat denaturation and cenrr.- 
fuganon (the cDNA-oligotex-dT w form, a pellet and the supernatant ,s removed). 
A quantity of tester mRNA is then repeatedly hybridized to the immobilized control 
(driver) cDNA (which is present in 20-fold excess). After several rounds of 
hybridization the only mRNA molecules left in the tester mRNA population are 

tts«r«^ fi r™Svl OUnd> d f " cDNA - oli ^te.x.dT je population. These 
tester-specific mRNA species are then convened to cDNA and. following the 
addition of adaptor sequences, amplified by PCR. The PCR products are then 
hgated into a vector for further analysis using restriction sites incorporated into the 
i^K primers. A schematic illustration of this subtraction process is shown in figure 

However, all these methods utilising physical separation have been described as 
inefficient due to the requirement for large starting amounts of mRNA. significant 
loss of material during the separation process and a need for several rounds of 
hybridization. Hence, new methods of differential expression analvsis have recentlv 
been designed to eliminate these problems. 
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Chemical Cross-Linking Subtraction /CCLS) 

In this technique, originally descnoed by Hampson et al. ( 1 992). driver mRNA 
is mixed with tester cDNA (1st strand only) in a ratio of > 20: 1. The common 
sequences form cDNArmRNA hybrids, leaving the tester specific specie, as single 
stranded I cDN A. Instead of physically separating these hybrids, thev are inactivated 
chemically using 2.S diaziridinyl-l.-Ubenzoquinone (DZQ). Labelled probes are 
then synthesized from the remaining single stranded cDNA species (unreacted 
mR.NA species remaining from the driver are not converted into probe material due 
to specificity of Sequenase T7 DNA polymerase used to make the probe) and used 
tqacreeaa cDN A library made from the tester cell population. A schemanc diagram 
oi tne system is shown in figure 3. 

.„ !\ h r£ ? m° ' h r n *" thC differen " a, 'y "pressed sequences can be enriched at 
least 300-fold with one round of subtraction (Hampson et al. 1992). and that the 
technique should allow isolation of cDNAs derived from transcripts that are present 
at less than 30 copies per cell. This equates to genes at the low end of intermediate 
abundance (see table 1). The main advantages of the CCLS approach are that it i, 
rapid, technically simple and also produces fewer false positives than other 
differential expression analysis methods. Howler, like the physical separation 
protocol, a major drawback with CCLS ,s the large amount of starting material 
required (at least 10 „ RNA). Consequently, the technique has recently been 
refined so that a renewable source of RNA can be generated. The degenerate random 
oligonucleotide primed (DROP) adaptation (Hampson et al. 1996. Hampson and 
Hampson 1997) uses random hexanucleotide sequences to prime solid phase- 
synthesized cDN A. Since each primer includes a T7 polymerase promoter sequence 
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ConWfdnvedmRMA Test (tester) mRNA 



1st strand cONA synthesis 
followed by alkaline hydrolysis I 



Mix and anneal 




T 



mRNfceWU hybrids AAAA 

Unique cONA species 




Cress Unking agent 
(D2Q) added 



Hybrids are cross-linked xmxxm AAAA 



-TTTT 



Probes synthesised from single strayed cONA 
speaes and used to probe cDNA libr»y 

tiprrwcd m the tester population Pink- .„ _ "'~""» * eouenca are dinensDally 

DNA polymers. wh 1 eM«fa^e r ,^Jr«Tl ^ Sequm-e 2.0 

" library tof done, of different,.!!* ££Z ' Ja U e n « Ad.-,.H J* *£ ^ * 'f™ * eDNA 
prmuma. ' " ° _^!!! Adapted from W titer el el. < 1 996). with 



Table I. 



The abundance of mRNA .p^e, and claiae, uT. rVp.cd «„ 

Mean mui 

f 

•peciea/cel] cUm in claaa 



tout RNA 



Abundant 12000 4 ii . 

500 0.08 ' i*S 

HOOP _ 0.004 0.002 



Intermediate 300 
R*re 15 



— Modified from BerdoLi wt al. (1995). ~J - 



rster) mRNA 
-AAAA 



-T7TT 



ed with |" strand tester 
Jre cross linked with 2.5 
uences are ainerentiaily 
-.ces uiinj 5eoueriA»e 2.0 
*e. ooes not react witn rne 
tr. uicc to screen a cDNA 
Waiter er a/. fl996). with 



urjnaiian cell. 



lean mass 
it of each 
secies 
oral RNA 



1.65 
0.0* 

o.oo: 



Differential gene expression 

annsense 

th. «m« DROP 'ZdL nSta^L. ^' 8 " ""' d " Ver C " be " nt, *" d f ""> 

Representational Difference Analysis (RDA) 

RDA of cDNA (Hubank and Schatz 19941 ;« 
origtnally applied to genomic DNA as a mel„ Vf i^tff !T -° f tCChn,qUe 
rwo complex genome, (Lis.tsvn et al 199^^' 
amplificat.on involving subtracts hvbridl tiin oA^lZ T™™ 
excess driver. Sequences in the tester rZZ u i m the P resence of 

rendered unamplifiable whTre« r£ ' homolo * u « » *e driver are 

ability to be -liS^IS^! *"? """^ ° n ' y fa the tester «*» *« 

In essence. «£^ a I?^Kf ,l,, ? ^ sch «™««y - figure 4. 
and amplified bv PCrTZI _T:^" P0p " lat,0nsa " fir » inverted tocDNA 
removed from bo* P £^^ 

amplified tester population onlv Dri«, * < dl "«ent) adaptor hgated to the 
hybridized togetUta ^o f E**? « « — »d 

homohybrids have 5' adaotor* « ' ^ i t h > brid,z "»on. only tester.tester 
in at both 3' end* H,«~ .u . A du P lex an d can. thus, be filled 

the subsequent PCR sten ithl T " ampHfied "P on ™ally during 
on,y ampafy m -J 

-* p ts^^sr^, 8 ^ srra t d mo,ecu,es are ^ 

homohvbnds.TheadaDto«V n r/ r? I PCR - ennchm «» ot the testerrtester 
the whole process repeated " funhe k""" POPUUt, ° n m *« T <* 1 ™* » d 

dnver (Hubank »SsL« J 2 .^^^ ^? T '~ of 
1:800000 for the second rK;,w /, ! / ' rat '° of ,:400 ' ^ 80000 and 
adaptors arehted t0 fhe ttrtnTn -P-ve.y). Different 

of hvbrid^ation and 

subsequent amplincanons. The final a^spl 7 S a Tn es ^f n 
geneproducts eas.ly observe on an erbium bromSc fjT™* " 

aVp^h^etivsuofd 0 :^ 0 ' T *" " * ^ rCPr ° dUC ' b,e «*» — 
reported ^^1^"", ' "^"^ gen "- Hubmk ^ Schatz (1994) 
substantially le« ^1 ^ a fL "^ diff «--"x "pressed in 

main drawbacHs^t mulnpt " „ d T« u° " deriVed - Perhaps * e 

digestion are requTrld ^ Dr ^l hybndiz " ion - «npUnation and 

differentia di^^J^J^,^ theref ° re ' len « thi « «»»» m»y other 
error to occur" AItr,o!U M P™ v,des more opportunity for operator-induced 

capture subtracnon (LCS) wasdesenbed by Yang and Sytowski (1996). 
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ds control (driver) cDNA 



test (tester) cOna 



Digest wm»re$tnc6on enzyme j 



Ugateto 

Gepnospnorytated 
12/24 adaptor 
strands 



Mett12mer 



PHI in 3" ends (Taq). add 

F™* ( ) and 

amplify 



Digest 



Digest and hgate 
new 1224 adaptor 



Mix 100:1. melt and hypndize 



T 



J 



I 



HI mends, add pnnw(-_ ) and amplify 

I i I 

Unearam^n Expo^, ^cation 

^ P m C ^^l wrtn *an nuclease o ^ 
™ m o«cules present arter ampirficaDon 

i 

- - Rret difference 

Dpnll. A «eond *et of 12/24 .d.p, or *« ' »" »daptor, ,. removed with 

PopuUuoa. after which the toter U hybrid -'1^1'° the amplified tmerVDNA 

•mphfied are thou which in tetter teater^JL ,? Product, which are exponential 

removed tth mux* b~n nuclei ieTv^^T^""^ F °"° Win « PCR - **DNA prXcS 
«rurd «, of 1 2/24 ad.pcon adde^f„« !' **«««• Product \ Thi, U digeWllH 

•age. The procea. U repe.S I « Sf? < ,ubt "™ on »«» the hyb^* * 

O »d Hubank and Sdu« (I^J - !^_-* ffere ™ P»du«. - described by iLi^ST? 



jcDna 



-digest and iigaie 
new 1274 aoaoror 



ncaticn 



^er and tester cDNA are 
t of i:.'2* adaptor ltrand* 
* Productt. The 12mer is 
poi>7neraje. Etch cDNA 
: adaptors u removed with 
ie irr.pi.ned t«ter cDNA 
of driver. The 12mer 
out with pnmeri identic§J 
■ which are exponentially 
>CR. jsDNA products are 
-ct'. Trus ts digested and a 
:«« from the hybridization 
ascribed by Litiuyn et ai. 



Differential gene expression 

oo 5 

Suppression PCR Subtractite Hybridisation (SSHj 

The most recent adaptation of the SH approach to differ*™,! 
analysis w„ fc« described b , Diatchenko „ Jf^STC 

enrich differenuallv exBr#«.H „.„ j , f h> ond '"tion serves to 

■he I^ZTtr ^ c ° mbm> "°'" «' *« ™,le s,nmd«d molecule, praen, i, 

competent cells Tw««f«~— j i • ' r *- K react »on into 

bv no mean, i£,^e3LtT'' ,0n, '" ""' <>r, " 0,y "««" *» » 

identified « bTa'ch^d ^ """^ ° f Cl ° n " w b < 

characterization or DN* ™ f. ' ? ""«* d ° n " for 

u- u '"" on - or a array (see later to quickJy identify known <5<5U 

no wy 14.643 (Rockert tt el. unpublished observations) The isol'arion «f 
d *«™^«P~sed g en M ^^^ 



666 



T«t«rcDNAwitfi8daptor1 



J- C. Rockeit « aj. 



On UCMI) 



Titter cDNAwrttijdiptor r 




««t tamplM, add frwh dtnatur * drtvtr, 



.annaal 





Add primars and 

«.d no amplification 

b "O'mplification.fuppfMieddueto 
rennatton of panhandla structure 

5 Hn w«mpimcaflofl 
p. • «ponentfal amplification 

form type £ moleculel ^ ^ ^7^^ ^ *" d »«««««By opn^d 
hybndtt, U on. vem i,ed together w^hluVn . ,ec ° nd "y hybridiut.on. the rwoori^v 

»re formed ui thu .econdiry hybridation 1,7 " prc " cd »«J«nc«. Type e molecule. 

pcr. The fin,, produeu i l ^^ 0 ; "^l q r; 1 Lr pl,fied ^J^SSS 

vector for dowratreun nunipuUtwH^ .« d - UW, «« *"<«iy or donedlmo . 



Differential gene expression 



srcDNA with adaptor 2 
ZB 



TV 



Control ammats 



Treated animais 



22-. 



22. 



32. 



due to 
ure 



i excess of dnver cDNA u 
cd ana aliowed to hybridize 
id abundant molecule* ; and 
not differentially exprrued 
iuation. the two primary 
ed driver can alto be added 
quences. Type e molecule* 
:phned uung rwo rounds of 
-'d directly or cloned into a 
't al. (1996) and Cunkaya 



ExmrnfUiAfron 
tissue of interest 
e.g.bver 



Extract mRNA from | 
tissue of interest 1 
e.g. liver j 



0ru5*trean*nt 



Onase-treament i 



Convert to cONA 



Complex probe tor 
screening etones 



Convert to cDNA 



; H ybridgaoon, subtraction and amptificaoon i 
«-Kontrol driving tester for utxegutatec genes i I Comoiex orooe 
letter dnving control for Oowrw*gulated genes -p*j for screening 



clones 



| Run out products on agaose gel 



Extract individual bands and done in 
T/A vector 



Screen using standard 
and HA agarose 



PCRofS-IOdone 
cultures per 
extracted band 



Different clones bio&ed 
and screened with up- 
regulated genes 



Screen using standad 
and HA agarose 



Plasmjd mint-preps 
of selected clones 



Differentially expressed 
clones selected 



Sequencing and 
■Oenoftcaoon 



Different dones Hotted 
and screened with down, 
regulated genes 



mdueen. ph,„ob.rb.«.j J^.W ' " tXpOWe ,0 th * "»>™ 

new compound, by comlari^ll chara « e »«»on of the toxic potential of 

produced b ^^S^£„7l*^ , ^/ ,0fi, •• lHey didt with 

profiles obtained from a typical SSH?»« • " c 1 Sh ° WS "P'""™ 

individual bands seq^enc^ld ? T'" 1 " Subse(?uem "Zoning of the 
wh,ch are d^r^'"^ £\ ^ ^ rCV "' S 

One of the advantages JlS^SSH » t ^ i" (tabI " 2 "* 3) " 

^u,red of .h.ch specific gene^ ^ eV^C 
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\\v.l*-643 treitment; 4-gene, UDreaub^tTfLil ,tment down regulated followmf 
w,th p«nn.«!on. "•»«"««. 6-lkb ladder. Reproduced trom Rocken « 

species) and produces 48 un a , hc rat (a s «wiuve 

res.stant specL (Roc^ Swa « EsdaTnd r^"^ ^ " P* « 

probably mtehiaiuticdlv " """^ "«* 

in a res.stam „ d , u$ceptib , e L^l " P K rC " ,0n ,S «**«n«lly regulated 

that the majonrv of genes « ^be Io ue n«d ^ the k d ° wn - ,,de °' «*» W-eh i, 

Differential Display (DD) - - 



-«rm em with U~V- 14.643 or 
was used to generate the 
fClontech). Lane: 1 — 1 kb 
\t% downregulated following 
rbitaJ treatment; 5 — genes 
<eproduced from Rock err et 

ibtained. For example, 
-logen Wy, 14.643, up- 
in the rat (a sensitive 
s in the guinea pig, a 
blished observations), 
down-regulated in rhe 
ely named TAPA-1) is 
irge number of cellular 
:inerentiation (Levy et 
:ent in rhe phenomena 
:t is intriguing, and 
iinerennally regulated 
ae or this approach is 
atabase sequences, but 
genes of completely 
rail assessment of the 
g the lack of complete 
?ene profiling studies 
« xenobiotic challenge, 
for further detailed 



y primed PCR* (Liang 
rred to as 'differential 



Tabic 



Differential gene expression 
Gene* up-regulated in rat liver following *;^u y exposure to phenooaroita: 



Band number 
i approximate 
size in bp) 


Highest sequence 
similarity 


5 (13001 


93.5 % 


7(1000) 


95.1% 


8 (950) 


98.3 % 


10 (S50) 


95.7% 


1 1 (800) 


Clone 1 94.9% 




Clone 2 75.3 °, 


12 (750) 


93.8 ° 0 


15(600) 


92.9% 


16(55) 


Done 1 95.2 % 




Clone 2 93.6 % 


21 (350) 


99.3% 



FASTA-EMBL gene identification 



CYP2B1 

Preproalbumin 

Serum albumin mRNA 

NCI.CCAp.Prl H laptens iZST) 

CYP2B1 

CYP2B1 

CVP2B2 

TRPM-2 mRNA 
Sulfated glycoprotein 
PreproaJbumm 
Serum albumin mRNA 
CYP2B1 

Haptoglobulin mRNA partial alpha 
18S. 5.8S 4 28S rRNa 



are ^^Z^'^ J^l??^ tC> * fti " * dot blot an.yisis and. therefore, 

.^u^L complete apeetrum ot genes which axe up- regulated in rat liver %v phenobaxbital but 
simply represents the genes sequenced and identified to date. ' mwmi ™> out 



Table 3. Genes down-regulatcd in rat liver following 3-day exposure to phenobarbiul. 



Band number 
(approximate 
me in bp) 



Highest sequence 
similanrv 



FASTA-EMBL gene identification 



1 (1500) 

2 (1200) 
3(1000) 
7(700) 



8 (650) 

9(600) 

10(550) 
H (525) 
12 (375) 
13 (23) 



14(170) 
15 (140) 
Others: (300) 
(275) 



Clone 1 
Clone 2 
Clone 3 
Clone 1 
Clone 2 
Clone 1 
Clone 2 



Gone 1 
Cione 2 
Clone 3 



95.3% 
92.3 % 
91.7% 
77.2% 
94.5% 
91.0% 
86.9% 
96.2% 
86.9% 
82.0% 
73.3% 
93.7% 
100.0% 
97.2 V 
100.0% 
100.0% 
96.0% 
97.3% 
96.7 % 
93.1% 



3-oxoacyl-CoA thiolase 
Hemopoxtn mRNA 
Alphs.2u-globuhn mRNA 
M.mxuculm CI inhibitor 
Electron transfer Ma vo protein 
M. muscului Topoisomerase 1 (Topo 1) 
Soarej 2NbMT M. musculut (EST) 
Alpha-2u-g lobuhn ts-rype) mRNA 
Soares mouse NML A/, musctdus (EST) 
Soares p3 NMF 19.5 A/, mvumlui (EST) 
Soares mouse NML M. musculus (EST) 
NCLCGAP-PM H. sapient (EST) 
Ribosomai protein 

Soares mouse embrvo NbMEl 35 (EST» 
Fibnnogen B-beta-cruun 
Apoiipoprotein E gene 
Soares p3NMF19.5 Af. mutcuius (EST) 
Stratagene mouse testis (EST) 
morvencus RASP 1 mRNA 
Soares mouse mammary gland (EST) 



EST - Expressed sequence tag. Bands ^ were shown to be false positive, bv dot blot analysis and 
£™ We " ^ ,W?UC r d Deriv ' d « al. (1 997). h should be noted M^^^L 

do not represent the complete spectrum of genes which are down -regulated in rat hver by phe^bLw 
but stnuply represent* the genes sequenced and identified to date. P ' 



display' (DD). In this method, all the mRNA species in the control and treated cell 
X^So™ W « "P^ate reactions using reverse transcriptase- PC R 

(RT-PCR). The products are then run side-by-side on sequencing gels. Those 
bands which are present in one display only, o* which are much more intense in one 
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■ be earned oJSE££l£ ? T T" 1 " ,he Speed *«* "".ch ,t can 
clones. V ° ° btam * d,,p,ay «* u Iinle * ■ to make and idenury 

Two commonly uied variations are based on differs - 

at t he 3 . end . e . g . 5 - (dT CA 3> and"ard« 19^ j '" T 

arbitrary primer mav be used for 1st strand !Sva L } " Alternat,ve «> - an 
This variant of RNA fin^rinrinl L T I * Symhem (Welsh " aL » Wl. 
Primed)-PCR. One adv^Z "f^? ? ^ 0,116,1 RAP ' Arbitrarily 
denved from ^tTZt^K "ctd aPPr ° aCh * *" PCR Pr ° dU « S ma >" b < 
can be used for mRNAs l^ Jtl^ ^ \ T mdin ' fram " In add '»°"- « 
(Wong and McCWland l5T Z to^ Tn' "?* M ba " e " al 
Saturation, second?,!^ tran "»P«- -d 

Primen have , b^'a each ^ ° Ut w,lh an « b »-n- Pnmer 

primers, which contain a m*r£e of a n f ? PM,n ° B ' M com P ar *d «> 
PCR. thus, produces ""'T^ ^ «" 

length and composition, polymerase ^ I i. P end,ng on the s > s "n> (primer 

dT-anchors and arbitral nrim^L L , ^ 3 comb,na «°n of different 
be amplified. W^^Jl^^d^* ^ al i mRNA *«» a cell can 

s.de by side on a poZ^I^^T ™° «e analysed 

the s a sSt™ be ident,fied ~ d 

ir, &r - — * — ei , n £ £ ss^szx 

(2> (^^^^^^ ^.«~ y - of the mRNA 
(G UU neraes« a /. 199Sa) Siiice the ° this may not always be the case 
shows vanation be Jen org ^ sms cDN £° J" » ° enbank ™ d 

5-^^^ - — y otten cannot be 

(Sun et al. 1994). Some adaowtTon, h I T """"^ m " P C ° 70 °«> of 
deluding the use o 7£ re" «> reduce &1* porfd^. 

comparison of unindLS an X " d ?™» ^ 

and comparison of DDPCR-oridult f mC (Bum " aL 1994 > 

.. "nes (Sompayrac « a/ 1995) , " i"" " nmdUCed and ™> ind ««* 

cytoplasmic' RNA rather^ S^^Le^ « * 

nuclear RNA that is not transported to L pL m am ' n8 ^ 

.ech^t'^ of the DD 

articles byLiang et al. (1995) and ^JJJS?^ " (1 " 6) " d *•» 
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peed with which it can 
k to make and identify 

ethods of priming the 
with a 2-base 'anchor' 
92). Alternatively, an 
is (Welsh et aL 1992). 
AP' (RNA Arbitrarily 
PCR products may be 
frames. In addition, it 
lany bacterial mRNAs 
erse transcription and 
:th an arbitrary primer 
compared to random 
>sition). The resulting 
on the system (primer 
ially includes 50-100 
"nbmation of different 
species from a cell can 
•pulations are analysed 
n can be identified and 
lysis. 

1 today for identifying 
ceived disadvantages: 

iRNAs (Bertioli et aL 
id the isolation of very 
stances (Guimeraes et 

y end of rhe mRNA 
or always be the case 
uded in Genbank and 
DD cannot aiwavs be 

id. 

?iay often cannot be 
m up to 70 ° 0 of cases 
reduce false positives, 
and Denman 1997), 
urse (Bum etal. 1994) 
■ced and two induced 
ported that the use of 
positives arising from 

Besses of the DD 
■ (1996) and from 



mRNA 




(dTn)CA; AC 



1" strand cDNA 

«« — ■ AC 

' UGAAAAAAA 



-AAAAAAAA 

Artitrary onmer 



1"$trano cONA 
* . 



-AAAAAAA 



Denature and synthesise 2* strand 



wnn any aroitrary pnmer 
2" strand cONA 



2"° strand cONA 
► 



i 

| 



cONA can now be am^ed by PCR using onginal pnmer oar 

Figure 8. Two approaches to differential di.pl.v (DD) an.lv,,, !« Itrand tVB(K 

either w«h a polydT u NN primer I whereV « C C Z l ? «ymhen. can be earned out 
different combmanon, of C C and\ to anL r fhe fi« "J""^ *L arb ' lra ^ The uit of 

of the majority of poiyadenJute^ mRN Z Z^Lt * P °' ydT Pnmer the * ril ™« 

Pl.ce, along the of ij M^S^^J^^ h ^ » — " *™ 

or more point, ,n the ,«me gene. In bo h cLT ^ " '° ^ " ^ °~ 

pnmer S.nce th«e arbitral pnmerT for r he ~ «™d 1 1 * ? L" T"* ° Ut wrth m 
» , number of Afferent pl.ce^.TeverS i5.««^!j^^^ bnte tt <ht lf J "™ d cDNA 
bmdmg point of the r „rand pnmc Fo^lZ « £2 SST T * °" e 
«. uaed to amplify tne «ood sirand pro^ l z^J^ ' *" ° f PTmm 

amplified, procucu. *«n the result that numerou, eene sequence* are 

Restriction endonuclease-facilitated analysis of gene expression 

Serial Analysis of Gene Expression (SAGE) 

1 , ' Based on rwo Pnncples. Firstly, in more than 95<" B of case* .Kort 
nodeot.de sequence C«^) of- only- nine or 10 base pain orovi^m^Z 

logetner in a series) of these tags allows sequencing of multiple cDNAs within . 

n""""' e,Uyn,e CHHtarin, ^ •,, , J 3 - £ 
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■h. preen., „d clon' d. t£S u « 'Tl .'™='«w. fom«d in 

on, y „«„ u „ d w KmilK , " ,,,, « *"* 

ne Expression Fingerprinting (CEF ) 

^J^ T Z,7^T^Ztr\ approach for ,so,ai ^ diff »^»v 

method. RXA ., convened to ^cDNA u S ( '" S) - In lhi » 

cDNA population ,s d^^tfd,™^ 0 " 5 '?* 11 

magnetic streptavidin microbeads t ^iV? "donuclcase and captured with 
products. The use of ™ '"^ ° f the unwan "° digestion 

cDNA fragment pool .Jn£ 3 ^£^"2; RX^" ^ «** 
not more than one restriction orodu« A« T , * pee,e * ,s re P res «"ed by 
amplification of the capered 1™ 2 ± IST " ^ t£> SUbSe « Uent 
specific and one Woti^^^^ \17 adapWr - 
recapruredandthenon-biotrnvlatedstLr? ™< 'eamphfied pepu i ation u 
non-biotinvlated strand b2»£v^* * a ' kaline The 

pnmermthepresenceof^ab^ 

end, are next sequentially treated with a series 0 <?* "nmob.lued 3' cDNA 

and the products from each 6^^^^^ ^^^^ 
composed of a number of ladders leoual ,T£e nun? ** * aa ««Print 

By comparing test versus control S^Z °' sequeaual di ««" ""dl. 

expressed products which Z ZZ^^ T^J* 

advantages of th.s procedure are that it i v L ^ *<' and doned - ^ 
authors estimate tLt 8^3 ^of cDN A ^ rcproducib,e . «* *« 

fingerprmt.The di»advanu«7,m a r ^ m 1 ° ,ecul « »™ involved in the final 
than 3CXMO0 bands7 w^^aT ^rl^lo^o?" " 
estimated to be produced in- an «v«« tne 1000 or more which are 

those described by UinerlX el al ( J?" ° f ^ ^ ,UCh M 
overcome this problem } Hmda " a/ " (1991 > "»V help to 

" describ^ ^gments was later 

digestion of the inunoWiS" 3?S^ J^f^^"^^ ° f ,equential 
compared the profiles ofc i-eStS ^ 

-manipulation »d -treated-populauons without further 
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rerent adaptor-specific 
immobilized 3'cDNA 
•tnction endonucleases 
e result is a fingerprint 
quenriai digests used i. 
) identify differentially 
eel and cloned. The 
reproducible, and the 
involved in the final 
3n rarely resolve more 
0 or more which are 
se of 2-D gels such as 
(1991) may help to 

ie fragments was later 
instead of sequential 
. these authors simply 
tions without further 



Differential gene expression 



■AAAA 



1* strand cDNA syntnesis using 
fcotmyiateo pory dT pmn 



I 



cONA cuaved with A£ ana 
captured with streptawvm oeaos 



GTAC 



-AAAA 



GTAC 



0 



Ohhdt in hatf and Hgate «nke* \^ 



CATC- 
GTAC. 



CATG- 
CTAC 



A A 



CATG- 
GTAC 

CATG 
GTAC- 



•tt 



Cleave with tagging enryme (TE) 
ana produce Wunt enas 



C5GATGCATGXWXXXXXX 
CCTACGTACXXXXJOOOtt 



GGATGCATCOOOOOOOOO 
CCTACGTAC0O00OO000 



TE AE 



TE AE 



Tag 



Lgate and amplify 



CGATGWTGXXXXWXXOOOOOOOOOCATCCATCC 
CCTACCTACXXXXXXXXXOOOOOOOOOGTACGTACC 



AE 



DiTag 



AE 



Ciaa* w*m AL tsaas aiTags. 
concatenate, aone ana 
sequence 

AE 



^TgCTXXXXXXOOOOOOOOOCATG XXXXXXXXXOttOOOOOOCATCr- 
^TACXOgXXXXXOOOOOOOOOGTAC WXXXXXXXOOOOOOOOOGT 



Tag 1 Tag 2 



Tag 3 Tag 4 
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DNA arrays 

Est*" " carried - ^S^?Ka" E - N ' orau ">' *• 

the aforementioned steps produce a bonll l , RT * PCR Ev en *o. each of 
of gene express.on. These problems w Jl h° lh f" ,nma " * oal * »P* analysis 
so-called DNA arrays (e Vg«T£ Z \ JJ? Tfo ^ *« de ^°P<™ « 

the introduction of which « h« "gVaHed h " ^ 199S ' Schena " 199 °>. 

analysis. DNA array, consi^ of^ndd ed T T differem,al * ene «P'"»°n 
hundreds or thousands of™" A s™ " each V gU " ch ' ps ' 

a known gene. The gene, are often sere« ed K h ° f mU,t,p,e cop '« o{ P*« of 
jn. oncogenesis. cdl^^j^^^S? ° n * Proven involvement 

They are usually chosen to be „ V/™*"!"" « d olh " "Uular 

processes. 

Human and mouse arravs are" re^ com ° n ' " Ch Wd ammaJ «P*c»es. 
will construct a personalia ed arrav to ordeTf and 3 few compan.es 

Research Genetics Inc. Th1«cnni que t^^o J C1 ° Bfeeh Labo "«"esand 

of genes can be spotted on ■ riwk^v P A*"' hUndred$ ° r even *ou,and, 
populations can be labelled In7 used di/ectlv " mR ; N " A ' cDNA <™ *» 'est 
appropriate hardware and softie ar^s Iff ' * ^ Whe " ™* 
assess differences ,n gene exp -sTon K Z " r3P ' d and Qua "«»t,ve means to 

can only be idouiiicSn^ 

(hence the term 'closed' svsternT TWori ^ m ' m 

molecular mechanisms » e,u « d «ing the 

to combine an open and closed svstem-a nv! 7 develo P mem s y»em may be 
quantuate the express.on of iSLTST J ^ R sTl " " d 
system such as SSH to isolate Populations, and an open 

One of the n»i»^^ **** « deferentially expressed 

which can be put on a mitSl^ 

60000 spots on a single glass cht^T C ° mpan,es have re P°"ed gridding up to 
based m,cro-arrays w!„ i^JZ^Zl^L' ««P- 
•terns in the near future. This should LVr I mass- produced off.the-.helf 
dUTerennd expression ,n 2 an it tZt, ^ « 
h«gn cost and the technical complexities in!! 7 CXpe " ments - Asid < bom their 
-ays. the mam problem ^^"^^T » d p "*«* DNA 
Igene-chip; technologies, is that result, a !l * pecuU> w,tft tne n * w er micro-amv 
arrays. However, this prob em l£*Z , ™ reprodu «We between . 

next few years. '" g addr «"<* and should be resolved within the 



^t^e^ 

cDNA libraries 9 Even Sj?^ E ^^ UMe V ° f don « «**»•□ from 
•dentificationisthebesttobehopedfor) thevhlll' "° T' idCntity (P u «rive 
means of discovenng new g cn es and „ ^ " * 3 r2p ' d ««* efficie « 
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at it takes a great deal 
m that they are indeed 
tissue. Normally, the 
3 CR. Even so. each of 
goal of rapid analysis 
3y the development of 
5. Schena et aL 1996), 
mtial gene expression 
iss 'chips* containing 
:tiple copies of parr of 
y proven involvement 
her cellular processes, 
ne and animal species, 
and a few companies 
tech Laboratories and 
ids or even thousands 
cDNA from the test 
When analysed with 
quantitative means to 
ions. Of course, there 
hich are in the array 
h to elucidating the 
pment system may be 
directly identify and 
iations, and an open 
-•rentiaiiy expressed, 
ber of gene fragments 
>orted gridding up to 
se high density chip* 
roduced orT-the-shelf 
pid determination of 
:ts. Aside from their 
and probing DNA 
e newer miero-axrav 
eproaucibie between 
*e resolved within the 



ressed genes 
rlones obtained from' 
il identity (putative 
s a rapid and efficient 
ite proriies of gene- 
Adams et aL (1991), 
imated that there are 
^presenting over half 



ot all human genes (Hillier et al. 1996). This large number of treeh liable 
sequences (both sequence information and clones are normaJI v available rovaltv.fr*. 

SrTr^L 0 "' h " eMb ! ed lHe deve,0 P m «" of - »«* 'PProach toward, 

difteremtal gene expression analysis as described by Vasmaoi, et al. (1998) The 

SSLfrf T*: EST daobases are * rst searched lor *» h ^ 

3 „o« » ES T^ uen " s from the '"get «™u. of cho.ee. but none or feu 
from non-«rget tusue libranes. Programmes to assist in the assemblv of such sets of 

r/me77or e^Ti \ 6 T^ d « <*™« Pnv«.iy or from the 

//H! P C ' thC InSntUte ,0r Genomic R "««n tTIGR. found at 
com m / um Tl t ^ 0 ( 2 Pr ° VideS m4ny SOftWare t0 ° ,S free ° f char *c to the sc ent.n 

Trt fi!° . ^ at,emWy * J"* ° f overl »PPin« data such as ESTs. bacte a 
«nn„ B d 1 t° mOSOn,ei (BA u C)$ - ° r Sma " gen ° m "- Cand ' da " E5 T done, repre 
SDecTcit 'TT ne$ " e J* 4 " 1 anal>Sed USmg RNA b,ot m « h0 * fo' »«• and tissue 
cDstcLTf ' f T 1 "?' ^ " Pr ° beS t0 "° Ute ■» d *e full length 

mm i2? I c ««c"""t,on. In practice however, the method « rarL 

confi^nZ ' bi ° info " n " ic computer analvs.s coupled w£ 

confirmatory molecular studies. Vasmatzis « */. (1998) have described severaJ 

approach - s r h as jeparat ' n * hi ^>- h -^o- 

EST "eauenc« H« ^ *** " overem P h "» of specificity for some 

EST sequences. However, since these problems will largelv be addressed bv the 
developrnent of more suitable computer algorithms and an .ncreased comp,et nt 

exJr~,LT " ibUe ' " ,$ hke,y th " »PP""h to identifying different^ 
expressed genes may enjoy more patronage in the future. 



Problems and potential of differential expression techniques 

The holistic or tingle cell approach ? 

When working with in vivo models of differential expression, one of the tint 
.ssues to cons.der must be the presence of mult, pl e cell types i„ anv g" „ sp ec L e n 
For example a l.ver sample ,s likely to CO nta,n not oniv hepatoses buT^o' 
po.em.alJy) Ito cells, bile ductule ce.ls. endothelials, vanous immune cell <e 7 
Ivmphocytes. macrophages and Kupffer cells) and fibroblasts. Other^ssues wSi 

a^l O 3,W T n0rmal - n ^ ia " ,c »d,or -•sp.ast.c cells present m a 
sample. One must, theretore. be aware that genes obtained from a different 
display experiment performed on an animal tissue model mav not LessaSv ami 
exclusively from the intended 'target" cells, e.g. hepatocvtea/neop^edrS 

p^fSST U T ~ oh,stochem,stry - " " 

nni K X-PCR should be used to confirm which cell tvpes are expressing the 
differential expression of genes in the -tfevetopmenr of different cell type, whirl 

^^^T^^^ Ce " P»P^*»»- ^e problem^w bSj 
addressed at the National Cancer Instinste (Bethesda, MD. USA) where new mierJ 
^ecnon technique, have been employed to assist in their gen! Sialys ^ p " 

tL^Z? l™< ^ Px0jeCt (CGAP i iFa * «ore mfonnation ,ee web site 
hrm //www.ncb, nlm.n,h.gov/ncicgap/intro.html). There are also separation 
mque, available that utilise ceU-specific antigen,-as a means to isolate Tget «1U 
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However. tho.e taking. holiSca D Drl^ '"^ Ro * ,er «<"- 1W8). 

There is an equaJ.v .pp^™ ZT^' ""^ *» p °~- 
widaa a compromized tisL should ^^^P™" " pr «"» 
t.»»ue, are complex mixes c f different ,ntOCOns,der "'°n After all. smce all 
regulate each ^'zg^ZdSS,^^ ^" " h '<* -nmare.v 
some way conu.buteUinveK or negaTvdvj " IT* * pe COU,d 

-h,ch he behind response, to e X te,^r, t 7m u I mecha ™"» 
«««««infcnn rt „ IDanT i , ^«^ »« » Perhaps 

opposed to ,» «rro models where T„J' d " pU> ' ex P er »ments us.ng m mo „ 
represent a parnal. skewed 0 r ev« ^l:^ 0 ^"^ ° f ,dent,Cal P» 

CVen ,nacc ««te p.cture of the molecular changes that 

oio,og,cal vanttIon 

« clear that mdividuai. (human, ndniZ^ T" m ° deiS "« bein * Used - It 
stimuli. One of the best char^erizTe differen < »»> 'to identical 
polymorphism, which i, mediated bv l^^ ^ "ebnsoqume oxidation 
^^^cia^^^^^^ CY * 2D6 a " d '"ermine, the 
Zanger 1997). The reasons foT^ch Q inW« drUgS( ^ nnard > *«. Meyer and 
vanations. regulatory reg.on po.^orph ^s "S" ^'1 "* COmp,M ' but 
can all contribute to observed idirWe .T^T"! ? ''^ menul he *" h 
should, therefore, be given to the spedfic obi~, lmd " al u respon "s- Careful thought 
value of poohng startmg matlaT e ° ^ ? * ^ 
benefical through the ironing out of exa«era^r« ° f this *» 

fluctuates of (mechanistically) ™ P ^ J aBd unim P°™< minor 

Providing , dearer overaJl puS^TLJS' ^ inm ^ *™ 

response. However, at the lame t,me such m n " ^ mechan "m* of the 

™P°^»d«din,th.^Xi^Sur^ r 7 nat ' 0nS ma> " be ° f UOno » 
effects of a given chemical /d.seas e : ma ' S " SUCCUmb t0 

or resist the 

mahj^p.od^crl^ »ugge.nng that mam- 

(Mechler and Rabb.rts 1981. H^Tal^TT^ ,pede » 31 ^ one *~ 
h.gh a, 20-30000 have also bee "trVd * B ? -J 990 *, ^though figures as 

prov.dedev.dencesuggesnngthatthrmatn^o^ « 9M > 
class. A breakdown of this abundance d2? ? " g IO the ™ re ab «ndance . 
■ WWthe results of ditTm^^^." ^ » 1. 
dataobtamed previously usmg other meTod^T * been com Pa«d with 

expressed ml^NAs are represented ^^ 

(which, importantly, often include regulatory »?«, " ? pa "' cular ' rare message, 
usuig differential display , vstems THi^ - P™*™) are not easily recovered 
«*NX Species exist at l^^t^V^S^^ * *? *< 
B '™l> «<-<1. (J99S) examined-the- effeenc^^£ rol5rpo P ul "«on (table 1). 
mRNA populations) for recovering I! f temP rtS (heie ">8*»eou, 

ng rare messages and were unable to detect mRNA 



. 1998/Kas-Deelen et 
. Rogler et ai 1998). 
rm issue unimportant, 
ing altered expression 
ion. After all, since all 
pes which intimately 
each cell type could in 
molecular mechanisms 
growth. It is perhaps 
ments using in vivo as 
ientical cells probably 
molecular changes that 

al biological variation 
dels are being used. It 
trent ways to identical 
:brisoqume oxidation 
o and determines the 
nard 1993. Meyer and 
d complex, but allelic 
cal and mental health 
nses. Careful thought 
dy and to the possible 
iffect of this can be 
id unimportant minor 
vidual animals, thus 
r mechanisms of the 
is may be of utmost 
ccumb to or resist the 



a nxzn percentage of 

ucgestine that mam- 
^ecies a; any one time 
). altnough hgures as 
Hednck et al. (1984) 
:o the rare abundance 
n tabie 1. 

been compared with 
Knot all differentially 
ticular, rare messages 

not easily recovered 
ig. as the majority of 

population (table 1). 
lates (heterogeneous 
able to detect mRNA 
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- . . _ o . , 

spec™ proem >t leu dan 1.2', of che ml mRN A popghrion-ramv,,.., „ ,„ 

up to lOOOo r^rtZTl > ^ S P eC,es P roducedb >«>K'ven mammalian cell. 
White dtf. ma7£T f o»°*«ng chem.cal snmular.on. 

have also prodded an apparent paucity of differentially expressed «nes Usin* SH 
tor examp e. Cao et al (19<)7\ f n „~A 1 1 j -a- H genes, using SH 

Q genes covered and the percentage that are true 
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one point in time. It „„ v ^ b h » «P«nment cm o„|j. in Mtt(> • " 

«U be i,. ,„ '~°^" d ">et«f„«. vi.al inform.,™ 

model system bef. reh » d „ p„^n^ ^ " ™ Uch ,n,c> ™ ati '» «bou,X 
■"WW »pec,„c » poj™ """ « stretep- c „ be derive}™ 

» conduct the experiment over ™„,„b,. i b '"" »"» Point tuwj 

^""'y » „.,„ of wor a k "^id ™ ur " whi < h - ° f £=' 

whose e,p,e,s,o„ „ ch „ d „ , w "ffal/i L* h •'«««• 



I 



ally effective — proving 
numbers of artificial 
e rare messages already 
models will genuinely 
s. In addition, there are 
pie, mRNAs may have 
amplification by PCR- 
circumstances not all 
evelopment. deadeny]- 
i Steitz 1998). whilst 
Hsp/O (and perhaps, 
:avalle«fl/. 1994). The 
e efficient* of systems 
cy of any system also 
tial display techniques 
:> isolate mRNA that is 
are used to prime first 
:ibed to some degree 
It has been shown, at 
can lead to inefficient 
Subtraction kit user 
o likewise in other SH 
:tion amplification step 
me sequences amplify 

the temporal factor. It 
ily interrogate a cell at 
genes showing altered 
disease processes and 
jscades of signalling, 
ies which are switched 
vital information may 
inrormation about the 
?y can be derived for 
xuiar interest to the 
r time point analysis is 
nich, of course, adds 



:ssue of how large the 
gene in question with 
:he isolation of genes 
reported using SSH 
mstratmg a change in 
here is a 'grey zone' — 
of isolation between 
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^^Tj^l^"^' ttrt »<>l»P •!>•• » 1.1 .mem could b. 

s&ssr see sir wit- L,chfie,d - ™ £ = 
assess TMrss^T " ! *-r ""^ 

c»,«. AnantiJc 1«9/. personal communication » -ince both 

DNA clone' "S ' ni ; ' he,tf0 "- """"" »«■•«'»"« '""^ ™« J ^ 
will *ffJ,J I a, " er L m AT /GC content, inclusion of a H.Vdve in the tel 

sSSSS????"- 

or di*«r^ f | M u ^ P ^^'". -* small-amount of reamplified 

or d.gested clone can be run on a standard high resolut.on gel. and a secoTaCot 
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F>(ure^l0^ ^i*^'^^^^^|^*^^^j l ^'MMl^idtmic»l lite uting HA>rtd. Btndi oi 4«crutin| 
e»D«nmem,„d cloned. Se. e „colo„ le rwe,e 0iek -H * ,u "P te . M ' on hvbnd,«« o' 

gel. and (B) a high re»oiution 2 % ag»ro,e .el conZL„.7 . « A> * h,gh '""'""on 2 % agarow 
n« don,, frem Meh £ °" « » ««» mn :l L m r HA-red. U„ h few exc^.^ 

gel B>. wh,eh .ep.ra.e, idenrieal!v.,ized 0\Tfc^S t " ow ""- pr»enc« of HA>red 
«h« aeouence. dearly ,„d,ea,e, «he p,eW e 0 ( I£™ * °" ,ht ^«"«Ht of CC within 
"ample, even .hough , I. five re-^plifieTeTo °/oST ,Pee ' M W,th,n " eh »»«d. Fo" 
Afferent gene specie, are repreaented * Bd ' Jppe " ,0 be ,he "™ ««• « lew. four 

'n a similar gel containing one of the H a «... -t>u 
any gross s, 2 e difference! whilst the HA ' ,Mndard *<' shou,d 
unresoKable spec.es (on s and ar dAGFr ta J ned Sh ° U ' d "henvi.e 
« -I- ( 1 997, reported fJZ^^^F 0 " 11 " " their ba " Geisinger 
clones. F.gure 10 shous such t exoerle '"T" """"^ D D-derived 
obtained from a band ^S^S'l^r " *" " 

An alternative approach ,s to car^- out a - D ?„ i ' 
proaucts. In this approach s.t- k« ' analysis or tne cifferennal displav 

a^rose gel. The geUl.ce tnt^";^ 10 " V™ ° Ut » 3 """^ 

» - HA ge, for resolunon bS^ A5 P 5 r « , S , | " ,fle * d "* ~"~"< 

*cne sp^hTcn I^e ^ ^ ^ different 
-en these spec.es are not unreso 4,^ B v ! ^"""fC/AT content. However. 

grad.ent field electrophoresis m^^^T^ lDGG ^° r,aa ^ m «' 
-her d.rectly on the extra Jed band (S^^Tw >^ "T * ' ^ 

product. U£UKI " a/ *991) or on the reamplified 

-^rs^^r techniques to ^ 

of numbers, the resolunon of PAGE rmlv?- -T,™"' 3 Pr ° b ' em in lh "« in W« 
overcome this might be to ZZdZ l\ u^SL °° ^ ° ne appro ' ch » 
( 1 989) and Hatada « „/ ( 1 991 ) SUch ^ te ^«cribed by Uitterlinden ef 
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IA-rcd. Bands of decreasing 
i subtractive hybridization 
each cloned band and their 
high resolution J °„ agarose 
-ed. With tew exceptions, all 
•er. the presence of HA-red 
he percentage of CC within 
;ies within each band. For 
w the same sue. at least four 



rd gel should indicate 
Id separate otherwise 
ase content. Geisinger 
entirying DD-denved 
:s laboratory on clones 

he differential display 
r;ea out m a standard 
:tec and incorporated 

there being different 
\T content. However, 
-again, one might use 
'GGE ) or temperature 
he contents of a band, 
or on the reamplified 

Jes to visualize large 
Dblem in that, in terms 
ands. One approach to 
bed by L'itterlinden et 



Extracnon of differenrialiy expressed bands from a gel can be complex since .n 
some cases (e.g. DD. GEF). the result, are vsualized by autoradiography mean 
such that prec.se overlay of the developed film on the gel must occur if the correct 
band u to be extracted for further analysis. Clearly, a misjudged extraction can 
account for many man-hours lost. This problem, and that of the use of radioisotopes 
ha, been addressed by several groups. For example, -Lohmann et al. <199 S i 

hSSSSi^ can * u,ed direct,y t0 v,,uah?e DD bands 

horizontal PAG, An et al. (1996) avoided the use of radioisotopes bv transferring a 
smal amount (20-300.) of the DNA from their DD to a nylon membrane and 
Msualiz.ng the bands using chemiluminescent stainmg before going back to extract 
the remammg DNA from the gel. Chen and Peck (1996) went one step further and 
transferred the entire DD to a nylon membrane. The DNA bands were then 
visualized using a digoxigenin (DIG) system (DIG was attached to the polvdT 
primer, used tn the differential display procedure). Differentially expressed bands 

One of the advantages of using technique, ,uch as SSH and RDA is that the final 
display can be run on an agarose gel and the bands visualized with simple ethidium 

w th" YBRG? 8 - ^t 1 ^ PrOV ' de aCCCptab,e overstamml 
v.,th S-i BR Green I or S\ BR Gold nucleic acid stains (FMC) erTectivelv enhance, 

andXT "1 T ° f thC bandS " 8reail >' a,ds in th " r P^cise extraction 
and often reveal, some faint products that may otherwise be overlooked. Whilst 
different,*! duplays stained w.th SYBR Green I are better visualized using sho" 
wave ength UV (2,4 nm) rather than medium wavelength (306 nm). the shorter 
wavelength is much more DNA damaging. In practice. ,t takes onlv a few second, 
to damage DNA extracted under 254 nm irradiation, effective* preventing 
^amplification and cloning. The best approach i, to overstain with SYBR Green I 
and extract band, under a medium wavelength L'V transillumination. 

The possible use of 'microfingerpriming• to reduce complexity 

Given the sheer number of gene products and the possible complexirv of each 
band, an alternative approach to rapid characterization may be to use an enhanced 
analysis of a small section of a differennal display-a •sub-nn.erpnnf or 'micro- 
nngerpnnt In tnis case, one couid concentrate on those bands which oniv appear 
in a particular chosen sue region. Reaucmg the fingerprint ,n this wav has at least 
rwo advantages. One ,s that ,t should be possible to use different gel tvpes 
concentrations and run times tailored exactly to that region. Currently one might 
run products from 1 00-3000 - bp on the same gel. which leads to compromize in the 
gel system being used and consequently to suboptimal resolut.on. both in terms of 
size and numbers, and can lead to problem, in the accurate excision of individual" 
bands. Secondly. ,t may be possible to enhance resolution by using a 2-D analvsis 
using a HA-stain. as described earlier. In summary, if a range of gene product sizes 
., carefully chosen to included certain ' relevant ' genes, the 2-D system standardized 
and appropriate gene analysis used, it may be possible to develop a method for the 
ear y and rapid identification of compounds which have s.milar or widely different 
cellular effects If the prognosis for exposure to one or more other chemicals which 

■S^ f ? ' Pr - ' "^^^o^th^n^ould perhaps predict similar 
effects for any new compounds which show a similar micro-fingerprint. 
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Screening 

False positives 



1994, Sompayrac et al. 1995). ThI JeMj^foV m N,$h !°" fl/ - ,994 - Sun«a/. 
technique being used. For instLce in RDA * , POS,nV " Vane$ *' th < he 

lotion event, (0 'Ne" ^ d S^^^."? ^jf " throu * h 

to be denved largely from abundant Le ,oec£ .irh ^ P ° S ' nVeS appw 
cDNA/mRNA species which do nTr !nH althou * h somc ™v arise from 

A quick screeLng of pu ante diff. T hybnd,2 " ion *>' "chn.cal rewon,. 
using a simple dot Wot ap^h t whTcM K TTT" ^ M b < Carried ou < 
from tester and driver mRVA a^e £w 5 ! " mMd pr ° bes Resized 
al. 1984. Salcaguchi « a/ 19861 D M ~ *" ° f Said d ° nes (Hedri <* « 

tester probe, bu^not driver SS^^ ^ WiU h > bridi " «• 

may not generate detectable hybrid Si? f ^ 3PPr ° aCh " that rare 
•s to screen the clones usTng U«wE ^b^" ^ ! Pt '° n f ° r th ° Se usin * SS H 
from which it wa, derived and w th Pr ° b < gener f ed {r ™ *• subtracted cONA 
react,on(ClonTechniq U es ^.^S^SH ^IT *' 
it should be possible to confi rm the p^esVn' !f *? m " h0d enr,Ch " " re 
eenes. Despite this quick S^^^TS — n^"^ ^ rtu1 ^ 
ongmal mRNA and cr^H^SL " " ' *' need 10 *° back » th « 
approach. .Although thUma^be ac^ """v" 5 " USU>? 3 ™ re ^""tative 

sensitive determinations (see below) "* ° dS '° r 3CCUrate ««* 

Sequence analysis 

The majority of differential displav orocedur« c . 

between 100 and lOOObp in size HoZv Cr ^ ! fina products which are 

the sequence for analysis of the UNA T^^^ 1 ^^ of 
confidence in the result-sever^ f fiuni£ ToTl u ^ ^ l ° * feduced 
fences .re- aimer idem^-^^ DNA 



mine altered expression 
R primers and /or post- 
* receptors, cell cycling 
onsidered as candidates 
arrays (e.g. Clontech's 
:his to some degree by 
.poptosis, stress. DNA- 



at length amongst the 
uoet ai 1994, Sun etal. 
stives vanes with the 
iaptors which have not 
ves through illegitimate 
they can arise through 
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for technical reasons, 
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inches rare sequences, 
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siots. :ne sensitivity is 
-*:hods for accurate and 
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bv f *" U f re « u '» ed jn K«r of rat, exposed ro Wy-14.643 and was .dennned 
L J ," arCh * bCmB tramfemn data not shown). However trans," Zt 

before^ u/eSd? "^'2 %T' k S " •"""■>'»'>- » '«""><•»=>• >""... c,.„ 

•cn prevent the formation of appropriate hvbnds. espec.allv at the 
concentrates required for effident hybnd.zarion. ' ^ 

S;^:r fragmems prov,des be ™ "p—n ot - 

exores«d ^nvT w ? • ■ $ ' SOme fri s™™ "om differentially 

ceSures Ho be ; ,,m '™ ed during subtracts hvbndxzation pro'- 

cedures. However, other fragments may be enrich* and .solated £ a 
consequence of dm. some genes wil, be cut one or more rime,. g,ving rise tonvo 

Sequent, comp„i„„, a |„ threw up „ 0[h „ im , , 
of >eq U ence .mutant; doe. one accept a reiult It 90 • i8.n,i,„. k!LL * 

Quantitative analysis 

c«d A idar«nTeith n er TIT' 'T'T™ <° the *«»*«iv, analysi, of the 
exorl!"! ? CanS ° f confirmin 8 that they are trulv differentially 

method of cho,c^ for cormrn^rdrffertnria^xpr^^ 
somewhat more complex than Northern analysis, rlZriZ syThesi It ^ a 
opumization of reaction condition^ for each gene species ft s BO ° f P" me " and 
hi* throughput PCR sy^-u-iing ^W^^^™* 
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money and time neededT de^foo i ftl ° B , an ,tandard « < h « 

especially when one might Z^n^^ 11101 ^ olteult is often ««"»e. 
u.e of semi. q ua„ tiu ^^ v ^^ ^i? 1 * 0t ' *™ »P«'«- The 
must first of all choo,e an uuer^ t«l d t a ^ BW,K * d ' ^ 

^totoe^ Xw ^«*- not cnange :n the test cells 
example interferon-gamm. (IFN- 7 . Frve 198^ 7 u " " ** for 
J«>xeraldehyde.3.ph«ph.te ^^^0^^ * 
hydrofolate reducta* (DHFR, Mohler and bS£7w> B^l'J i kT' 
m, Murphy er a/. 1990). hvpoxanthin* -I . ^-"""oglobuhn 

at- 1998) and a number VS?S2^^^ W (HPR T. Fosse, 
standard should not chwje i» ,eTel ( 2«£^?^ an 
stage in the cell cycle or through tht^T™ "V** regard,e " of 
shown on numerous oc^^^^^^t » h « 

used by the research community do in fieri ho«"kee P ,ng genes currently 

different u„ues (Clon^^el ,997b ""^ """" C ° ndition * ™ d » 

limmaryexper^entsbecaS rut on ^ therefore - *« P- 

rhe,r suuabiliry for use in dTmodd "stem ° f » establish 

seated w.th caution. By 

gain msight into why rwldiffe«^ eXpre "' on - ™ P-ha P . 

For example, rats and mice appear «nsi ive to Ways te «~« ««muli. 

range of peroxisome proliferated w hl l« Svnal h t ^ non -« en J otox ' c «*«t, of a wide 
resistant (Orton « «/. ,984 Rodricks Ld T tT?^ ?Uin " pi *« are lar *"v 
Makowska et al. ,992) A sirnolifieH u^"" 1987 ' Lake " 19 «9, 1 993. 

compare lists of up- ^^^^ " "J"™ 8 *« """^ wh >' - * 
expressed in only one spt^Z^ltXT? " idBKi * th °" whi "> « 

t Hesaidgene.m ig htsug™cU^ 

or protecnon. Of course, the situation » 1 kelv £ 8 * ""genesis 

there were one key gene protecting *uiZ r I f " m ° re COmp, «- P « h *P« * 
upregulated SO times bv P P$ the sfm T'l * IT nO "-8 en0t0J " c ^«s anditwas 
in the rat. However, since both we Hot'ed to^ ^ ^ Up - re * uI "«* «* *»« 
gene may be overlooked. Just to com^l Upre « uiated - thc ""portance of the 

rrue relevance of gene Y ^ch^lsoZ* ^ F " """^ What » lhe 

and gene Z which shows onlv a d .ntt after 3 P articula ' treatment, 

may find that histoncally. gene Y has often iT* u eXimines the literature one 
fold by a number of JS^^fT-^Z" " be 40-60- 
appear less significant. However ^ S^Tj^ * ?" 1116 

recorded a, having more th j ^'r"™ 5 ' sh ° W that ^ Z has never been 
increase all the more ««ring " ^"r" 1 ^^ makes your 5-fold 

increase has only been seen irelS^SS" * " "~ ^ 

chemicals. , . ""Prasms or following treatment with related 

Prbbrem, 15 usuTg t^aifl^ifiTdi^ 

. -veloprn^, proc L or <o^^^t^^^"£ 
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become clear that the fingerprinting process, whilst still valid, .s much too comoi„ 
to be represented by , single techmque profile. Th„ IS because all dtferemu Td'p " 

s" aS "JPZ C °r 0n "I'V UniqUC teChniMl probi ™ *h«h preld ' he 
.solanon and .dentmcanon of all those genes which show changes .n e.press.on 

SSSlS" are impona ? t genctic chan * e$ re,ated <° *L vzi?™ t 

*h.ch d.fteren«al express.on analysis i, , imp lv not des.gned to address \„ examUe 

seen' in tXr" ° f iml11 de,Ct,0nS " mSert, ° nS - " ^ -—u^s 

^pSJ^lSn^r^- tUm ° Ur SUPPre "° r gCn " and ,nd - dual Po'v 
morpmsms. Po ymorph.c varrations. small though thev usually a-e are afo n 

rXno- bene^ 8 * im >°™« » "p.ammg whv "me pat nts 

respond better than others to certain drug treatments (and. in logical extension « h 

other,). The .dentxnc.uon of such point mutat.ons and naturall ocTuVri™ 

« TSSK? ^""r SUb,eqUem app,ica "° n of « q uL?ng. SSCP DGGE 
To Idd?l , gen ! ° f T™' Fu " he ™°". d*«nril d.solav „ « des.gned 

^Z^^^ na T r ^ M gene spec,es or whether » «~ d 

stab*" " * rMU,t ° f mCre " ed tran "»P»on or increased mRNA 



Conclusions 

they.'™ advantag ! of °P en differential d.splav techniques is that 

they are no hm.ted by extant theories or researcher bias in rev eal ng genes which are 
d.rTerent.ally expressed, since they are des.gned to amplify al! Jenls whfch 
demonstrate altered expression. This means that they are useful' for of 
prev.ous.y unknown genes wh.ch may rum our be usefu. b.omarkers of pTr^cX 

SLY," nd " ,0n - At ,east one °P en »?"*" (SAGE) ,s also quantitativ^ thu 
el.rn.nat.ng the need to return to the original mRNA and earn- out Northem^PCR 
analy,.s ro confirm the result. However, the rapid progress of genome ^.00^ 

u.H sw.tch from open to closed differential display systems, particularly DNA 
arrays .Arrays are eas.er and faster ro prepare and use. prov.de ouLntamTdata are 
suable tor n.gn throughput analvs.s and can be tailored to look at soec nc sieSh"g 

common 5 " ""T l6 ' n ™™» °< ■» *« *™ ,n hu^»5 

common laboratory animals combined with improved D\\ arrav teehnolaav 
means that it will soon no longer be necessary to try* .so.ate dinUSl^E 
genes us.ng the technically more demanding open svstem approach Thus ie'r 

"ITy *e«fore ^ °' «— > ~* be lately erad^ed-hL 
hkely. therefore^ the.r sphere of application will be reduced to analysis of the 
le» common laboratory spec.es. since it will be some time vet before the genome, of 

betq^ed" ^ ***** ^ - d ^ ^ ^ 

Of course, in the end the question will always remain: What is the functional/ 
>o.og,cal sigmncance of the identified, differentially expressed genes? One 
pers.,tent problem is understanding whether differenrially expressed gene, are a 
Ton* e . e °„ r o C ^ ,eqUenCe * " ate - F "*"™<»<> ™v chemicaU. ,uch as 
rephcanon w.11 also be upregulated but may have little or nothing to do with the 
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carcinogenic effect. Whikt differential displav technology cannot hone » Q ,„ 

-.due, „.„ ^Z^tt72tt?j£i?r"1; 

pho tt8t , ph ,_ n impo«ibl« u*. I„ order ,. und^nund the battle *e Hu!o 

^entutmustu^^ » be ^'ormng. the 

knockout technologTtn^^ W " H ° ther techni « u «' » 

«me and do,e reVpon, e «a£« fe ^^^?"*^-» u » ri «»««^«d 

mo.t dRe^ou, tma^T - a8e ' °' n ^''^J±^.p«h. P , ".dice th. 

°N?,e^^^^^ 

know!ed.e contribute,^* to , h , unde^in, _.f hu m " d.lLe p^eml. 
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The availability of genome-scale DNA sequence information and reagents has>adicallv altered ■» 
research. This revo ut on has led to the devi»inn m » n t = * Irf. raaicaiiy altered life-science 

t.on of the fields of toxico.ogy and gSno^^^^ ^discipline derived from a combma- 

icentification of potential human and envTr^ i$ concerned with the 

«.« o. .source, -ZTSS ^S^SSV?"*. 

the expression levels of thousands of genes simultaneously. Here we ?opo e a ^f^^Z^ n9 
expression, as measured by cDNA microarray «n k a . JLm Z P r »Po*e a general method by which gene 
toxicity. Our purpose is to ^ua^ 3 h ' 9h,y !» nw,vt and informative marker for 

1 ogy and to present our v.ewof the ^ usefulness oT^ ° f miCr ° arra * tech ™'- 

I 1S9, 1999. o 1999 wiiey-uss. me. microarrays to the field of toxicology. Mo/. Carcinog. 24:153- 
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INTRODUCTION 
Technological advancements combined with in- 
tensive DNA sequencing efforts have generated an 
enormous database of sequence information over the 
past decade. To date, more than 3 million sequences 
totaling over 2.2 billion bases [1], are contained 
within the GenBank database, which includes the 
complete sequences of 19 different organisms [21. The 
:rst complete sequence of a free-living organism 
Haemophilus influenzae, was reported in 1995 |3| and 
was toilowed shortly thereafter by the first complete 
sequence of a eukaryote, Saccharomvces cervisiae (4). 
The development of dramatically improved sequenc- 
ing methodologies promises that complete elucida- 
tion or the Homo sapiens DNA sequence is not far 
bemnd :S\. 

To expioirmore ruilv rhe wealth or new sequence 
information, it was necessary to develop novel meth- 
ods for the high-throughput or parallel monitoring 
of gene expression. Established methods such as 
northern blotting, RNAse protection assavs, SI nu- 
clease analysis, plaque hybridization, and' slot blots 
do not provide sufficient throughput to effectively 
utilize the new genomics resources. Newer methods 
such as differential display [6], high-density filter 
hybndization (7.8), serial analysis of gene expression 
[9], and cDNA- and oligonucleotide-based microarray 
"chip" hybridization [10-12] are possible solutions 
to this bottleneck. It is our belief that the microarray 
approach, which allows the monitoring of expres- 
sion levels of thousands of genes simultaneously, is 
a tool of unprecedented power for use in toxicology 
studies. 67 



Almost without exception, gene expression is al- 
tered during toxicity, as either a direct or indirect 
result of toxicant exposure. The challenge facing 
toxicologists is to define, under a given set of ex- 
penmental conditions, the characteristic and spe- 
cific pattern of gene expression elicited bv a given 
toxicant. Microarray technology offers an ideal plat- 
form for this type of analysis and could be the foun- 
dation for a fundamental^ new approach to 
toxicology testing. 

MICROARRAY DEVELOPMENT AND APPLICATIONS 
cDNA Microarrays 

In the past several years, numerous svstems were 
developed for the construction or laree-scaie DNA 
arravs. AU'or these piatrorms are oasea on cDNAs 
or oligonucleotides immobilized to a solid sup- 
port. In the cDNA approach, cDNA tor genomic) 
clones of interest are arrayed in a multi-well for- 
mat and amplified by polymerase chain reaction. 
The products of this amplification, which are usu- 
ally 500- to 2000-bp clones from the 3' regions of 
the genes of interest, are then spotted onto solid 
support by using high-speed robotics. By using 
this method, microarrays of up to 10 000 clones 
can be generated by spotting onto a glass substrate 
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(13.14). Sample detection for microarravs on glass 
. involves the use of probes labeled with fluores- 

_ cent or radioactive nucleotides. ._ 

Fluorescent cDNA probes are generated from con- 
trol and test RNA samples in single-round reverse-tran- 
scription reactions in the presence of fluorescentiv 
tagged dLTP (e.g.. Cy3-dLTP and Cy5-dlTP), which 
produces control and test products labeled with dif- 
ferent fluors. The cDNAs generated from these two 
populations, collectively termed the "probe." are then 
mixed and hybridized to the array under a glass cov- 
erslip [10.11.15]. The fluorescent signal is detected 
by using a custom-designed scanning confocal mi- 
croscope equipped with a motorized stage and lasers 
for fluor excitation [10.1 1.15]. The data are analyzed 
with custom digital image analysis software that de- 
termines for each DNA feature the ratio of fluor 1 to 
fluor 2, corrected for local background (16,17). The 
strength of this approach lies in the ability to label 
RNAs from control and treated samples with differ- 
ent fluorescent nucleotides, allowing for the simul- 
taneous hybridization and detection of both 
populations on one microarray. This method elimi- 
nates the need to control for hybridization between 
arrays. The research groups of Drs. Patrick Brown and 
Ron Davis at Stanford University spearheaded the 
effort to develop this approach, which has been suc- 
cessfully applied to studies of Arabidopsis thaliana 
RNA [10], yeast genomic DNA [15], tumorigenic ver- 
sus non-tumorigenic human tumor cell lines (11], 
human T-cells [18], yeast RNA [19], and human in- 
flammatory disease-related genes [20], The most dra- 
matic result of this effort was the first published 
account of gene expression of an entire genome, that 
of the yeast Sacduiromvces cervisiae [21], 

In an alternative approach, large numbers of cDNA 
clones can be spotted onto a membrane support, al- 
beit at a lower density [7.22]. This method is useful 
for expression profiling and large-scale screening and 
mapping of genomic or cDNA clones [ 7.22-24 1. In 
expression profiling on filter membranes, two uir- 
rerent membranes are used simultaneousiv tor con- 
trol and test RNA hybridizations, or a single 
membrane is stripped and reprobed. The signal is 
detected by using radioactive nucleotides and visu- 
alized by phosphorimager analysis or autoradiogra- 
phy. Numerous companies now sell such cDNA 
membranes and software to analyze the imaee data 
[25-27]. 6 

Oligonucleotide Microarrays 

Oligonucleotide microarrays are constructed either 
by spotting prefabricated oligos on a glass support 
[13] or by the more elegant method of direct in situ 
oligo synthesis on the glass surface by photolithog- 
raphy [28-30]. The strength of this approach lies in 
its ability to discriminate DNA molecules based on 
single base-pair difference. This allows the applica- 
tion of this method to the fields of medical diagnos- 



tics, pharmacogenetics, and sequencing bv hybrid" 
ization as well as gene-expression analysis. 

Fabrication of oligonucleotide chips by r photoli- 
thography is theoretically simple but technical^ 
complex [29.30]. The light from a high-intensitir 
mercury lamp is directed through a photolitho. 
graphic mask onto the silica surface, resulting m 
deprotection of the terminal nucleotides in the ilb 
minated regions. The entire chip is then reacted with 
the desired free nucleotide, resulting in selected chain 
elongation. This process requires onlv 4n cvcles 
ivvhere n = oligonucleotide length in basesi to svn. . 
thesize a vast number of unique oligos. the total num- 
ber of which is limited only by the complexity of the 
photolithographic mask and the chip size [29J t .; j-n 

Sample preparation involves the generation of 
double-stranded cDNA from cellular polv<A>+ R\a 
followed by antisense RNA synthesis in an in vitro 
transcription reaction with biotinylated or fluor- 
tagged nucleotides. The RNA probe is then frag, 
mented to facilitate hybridization. If the indirect 
visualization method is used, the chips are incubated 
with fluor-linked streptavidin ie.g.. phvcoervthrin) 
after hybridization [12.33]. The signal is detected with, 
a custom confocal scanner (34 1. This method has 
been applied successfully to the mapping of genomic 
library clones [35], to de novo sequencing by hybrid- 
ization [28.36], and to evolutionary sequence com- 
parison of the BRCA1 gene [37|. In addition, 
mutations in the cystic fibrosis |38| and BRCA1 [39] 
gene products and polymorphisms in the human im- 
munodeficiency virus- 1 clade B protease gene (401 
have been detected by this method. Oligonucleotide 
chips are also useful for expression monitoring |33) 
as has been demonstrated by the simultaneous evalu- 
ation of gene-expression patterns in nearlv all open 
reading frames of the yeast strain S. cerevisiae [12|. 
More recently, oligonucleotide chips have been used 
to help identifv single nucleotide polvmorphisms in 
the human 14 1| and yeast |42| cenonies. 

THE USE OF MICROARRAYS IN TOXICOLOGY 
Screening for Mechanism of Action 

The field of toxicology uses numerous in vivo 
model systems, including the rat. mouse, and rab- 
bit, to assess potential toxicity and these bioassays 
are the mainstay of toxicology testing. However, in 
the past several decades, a plethora of in vitro tech- * 
niques have been developed to measure toxicity, 
many of which measure toxicant-induced DNA dam- 
age. Examples of these assays include the Ames test 
the Syrian hamster.embryo cell transformation as- 
say, micronucleus assays, measurements of sister 
chromatid exchange and unscheduled DNA synthe- 
sis, and many others. Fundamental to ail of these 
methods is the fact that toxicity is often preceded 
by. and results in, alterations in gene expression. In 
many cases, these changes in gene expression are a 
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ffar more sensitive, characteristic, and measurable 
fendpoint than the toxicity itself. We therefore pro- 
pose that a method based on measurements of the 
^genome-wide gene expression pattern of an organ- 
isrr, after toxicant exposure is fundamentally tnfor- 
"m-" v c and complements the established methods 
described above. 

We are developing a method by which toxicants 
can be identified and their putative mechanisms of 
action determined by using toxicant-induced gene ex- 
pression profiles. In this method, in one or more de- 
fined model systems, dose and time-course parameters 
are established tor a series of toxicants within a given 
prototypic class (e.g.. polycyclic aromatic hydrocar- 
fc: -is (PAHs)). Cells are then treated with these agents 
a: j fixed toxicity level (as measured by cell survival), 
RNA is harvested, and toxicant-induced gene expres- 
sion changes are assessed by hybridization to a cDNA 
microarray chip (Figure 1 ). We have developed a cus- 
* torn DNA chip, called ToxChip vl.O, specifically for 
this purpose and will discuss it in more detail below. 
The changes in gene expression induced by the test 
agents in the model systems are analyzed, and the 
common set of changes unique to that class of toxi- 
cs, termed a toxicant signature, is determined. 
This signature is derived by ranking across all ex- 
periments the gene-expression data based on rela- 

Control 
Population 
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tive fold induction or suppression of genes in treated 
samples versus untreated controls and selecting the 
most consistently different signals across the sample 
set. A different signature may be established for each 
prototypic toxicant class. Once the signatures are de- 
termined, gene-expression profiles induced by un- 
known agents in these same model systems can then 
be compared with the established signatures. A match 
assigns a putative mechanism of action to the test 
compound. Figure 2 illustrates this signature method 
for different types of oxidant stressors. PAHs, and 
peroxisome proliferators. In this example, the un- 
known compound in question had a gene-expres- 
sion profile similar to that of the oxidant stressors in 
the database. We anticipate that this general method 
will also reveal cross talk between different pathways 
induced by a single agent (e.g., reveal that a com- 
pound has both PAH-like and oxidant-like proper- 
ties). In the future, it may be necessary to distinguish 
very subtle differences between compounds within 
a very large sample set (e.g.. thousands of highly simi- 
lar structural isomers in a combinatorial chemistry 
library or peptide library). To generate these highly 
refined signatures, standard statistical clustering tech- 
niques or principal-component analysis can be used. 

For the studies outlined in Figure 2. we developed 
the custom cDNA microarray chip ToxChip vl.O. 

Treated 
Population 



| RNA Isolation j " 

Cy3 V;: D ^ ? 

••^ Reverse 

Transcription 



DNA "Chip" 



7. 



Q Mix cDNAs and 
n Apply to Array 



Hybridize Under 
Coverslip 




Figure 1. Simplified overview of the method for sample 
preparation and hybridization to cDNA microarray*. For illus- 



trative purposes, samples derived from cell culture are depicted, 
although other sample types are amenable to this analysis. 
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tifteHten n * * hemat, < ^Presentation of the method for iden- 
tificatton of a toxicant's mechanism of anion In thE m.th«H 
gene-express.on data derived from exwiure of m^i-. .° d ' 
terns to known toxicants arranged and ?.« 2°l el $ys ' 
character.st.c to that typ of ffint ^ JTL°. f f*-" 9 " 
signature) is *entif i.dTde*c^ £«£ 

The 2090 human genes that comprise this subarrav 
were selected for their well-documented involve- 
mem in basic cellular processes as well as their re- 
sponses to different types of toxic insult. Included 
on th,s list are DNA replication and repair genes 
apoptosis genes, and genes responsive to PAHs and 
dioxin-hke compounds, peroxisome proliferators 
estrogenic compounds, and oxidant stress. Some of 
he other categories of genes include transcription 
actors, oncogenes, rumor suppressor genes, cvclins 
kinases, pnosphatases. cell adhesion and motility 
genes, and homeobox genes. Also included in this 
group are 84 housekeeping genes, whose hybridiza- 
tion .ntensity is averaged and used for signal nor- 
malization of the other genes on the chip To date 
very few toxicants have been shown to have appre^ 
ciable effects on the expression of these housekeep. 
«ng genes. However, this housekeeping list will be 
revised if new data warrant the addirioJ or deletion 
of a particular gene. Table 1 contains a general de- 
scription of some of the different classes of genes 
that comprise ToxChip vl.O. 

JIU'M t0X u Cam si S natu " «* determined, the 
genes within this signature are flagged within the 

t^ZT? ™*"'™"* t0 * ra ™ « thw 
/ thC daU Can * ^ uickJ y ^formatted so that 
blocks of genes representing the different signatures 



the unknown agent mechan '»" of anion .$ assigned to 

are displayed (U|. This facilitates rapid, visual in- 
terpretation of data. We are also developing Tot 
Ch.p v2.0 and chips for other model svsV^'s. 
nc hiding rat. mouse. Xtmpus. and veast. for use in 
toxicology studies. ' 

Animal Models in Toxicology Testing 

The toxicology community relies heavilv on the 
use or animals as model systems ror toxicology test- 
.nrortunateiv. these assavs are .nnerentlv ex- 
pensive, require large numbers or animals and take a 
long time to complete and analyze. Therefore, the 

« ViSS K $ T e ° f Environm *™l Health Sciences 
2 the Nat,onal Toxicology Program, and the 
durin. T com ™ nit y at targe are committed to re- 
ducing the number of animals used, by developing 

I* 6 "' and a,ternativ e *«ting methodolokief 
dlvrtnt SU ? Stan ! ial P r °8 re " h« been made in the 
stm u£?f m 0t aJternative '""hods, bioassays are 
sull used for testing endpoints such as neurotoxic- 

S't^r 0 * 0 ^' re P roductiv e»nddevelopmen- - 

' ° ' col °^ and ^netic toxicology. The rodent 
c«n« , m y ' S 3 P articu,ar 'y expensive and time- 
2Z,u 8 ^ SSa , y ;- " U requires a,most 4 y. 1200 

kTi?5 a . ,,ons of dol,ar$ 10 execute and ana * 

2?iin2 '■, T ex P eriments of the type ouUined 
m Figure 2 might provide evidence that an unknown 
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Table 1. ToxChip vl.O: A Human cDNA Microarray 
"Chip Designed to Detect Responses to Toxic Insult 
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gene category 



No. of genes 
on cnto 



Apcotosis 

DN - reoiication and repair 
Oxiaative stress/reo©»*omeostasis 
Peroxisome proiiferator resDonsive 
"Dioxin/PAH resDonsive 
Estrogen resDonsive 
Housexeeoing 

Oncogenes and tumor suooressor genes 
Celt-cycle control 
Transcription factors 
"Kinases 
P— ;onatases 
Hc-j>snock proteins 
Receptors 
Cytochrome P450s 



72 
99 
90 
22 
12 
63 
84 
76 
51 
131 
276 
88 
23 
349 
30 



•This list is intended as a general guide. The gene categories are not 
uniaue, ana some genes are usteo m multiple categories. 

agent is (or is not) responsible for eliciting a given 
biological response. This information would help to 
sp ; ect a bioassay more specifically suited to the agent 
i: question or perhaps suggest that a bioassav is not 
necessary, which would dramatically reduce cost, 
animal use. and time, 

The addition of microarray techniques to stan- 
dard bioassays may dramatically enhance the sen- 
sitivity and interpretability of the bioassay and 
possibly reduce its cost. Gene-expression signatures 
could be determined for various types of tissue-spe- 
cific toxicants, and new compounds could be 
f --eened for these characteristic signatures, provid- 
ing a rapid and sensitive in vivo test. Also, because 
gene expression is often exquisitely sensitive to low 
doses of a toxicant, the combination of gene-expres- 
sion screening and the bioassay might allow the use 
of lower toxicant doses, which are more relevant to 
human exposure levels, and the use of fewer ani- 
mais. In addition, sene-expression chances are nor- 
mally measured in hours or days, not in the months 
to years required for tumor development. Further- 
. lore, microarrays might be particularly useful for 
investigating the relationship between acute and 
chronic toxicity and identifying secondary effects 
of a given toxicant by-studying the relationship 
between the duration of exposure to a toxicant and 
the gene-expression profile produced. Thus, a bio- 
assay that incorporates gene-expression signatures 
with traditional endpoints might be substantially 
shorter, use more realistic dose regimens, and cost 
substantially less than the current assays do. 

These considerations are also relevant for branches 
of toxicology not related to human health and not_ 
using rodents as model systems, such as aquatic toxi- 
cology and plant pathology. Bioassays based on the 
flathead minnow, Daphnia, and Arabadopsis could 



also be improved by the addition of microarrav analv- 
sis. The combination of microarravs with traditional 
bioassays might also be useful for investigating some 
of the more intractable problems in toxicology re- 
search, such as the effects of complex mixtures and 
the difficulties in cross-species extrapolation. 

Exposure Assessment, Environmental Monitoring 
and Drug Safety 

The currently used methods for assessment of ex- 
posure to chemical toxicants are based on measure- 
ment of tissue toxin levels or on surrogate markers 
of toxicity, termed biomarkers (e.g., peripheral blood 
levels of hepatic enzymes or DNA adducts). Because 
gene expression is a sensitive endpoint. gene expres- 
sion as measured with microarray technology may 
be useful as a new biomarker to more precisely iden- 
tify hazards and to assess exposure. Similarly, 
microarrays could be used in an environmental- 
monitoring capacity to measure the effect of poten- 
tial contaminants on the gene-expression profiles 
of resident organisms. In an analogous fashion, 
microarrays could be used to measure gene-expres^ 
sion endpoints in subjects in clinical trials. The com- 
bination of these gene-expression data and more 
established toxic endpoints in these trials could be 
used to define highly precise surrogates of safety. 

Gene-expression profiles in samples from exposed 
individuals could be compared to the profiles of the 
same individuals before exposure. From this infor- 
mation, the nature of the toxic exposure can be de- 
termined or a relative clinical safety factor estimated. 
In the future it may also be possible to estimate not 
only the nature but the dose of the toxicant for a 
given exposure, based on relative gene-expression 
levels. This general approach may be particularly 
appropriate for occupational-health applications, in 
which unexposed and exposed samples from the 
same individuals may be obtainable. For example, 
a pilot study of gene expression in peripheral-blood 
Ivmphocvtes of Polish coke-oven workers exposed 
:o PAHs iana many otner compounds) is under con- 
sideration arthe NIEHS. An important consideration 
for these types of studies is that gene expression can 
be affected by numerous factors, including diet, 
health, and personal habits. To reduce the effects 
of these confounding factors, it may be necessary 
to compare pools of control samples with pools of 
treated samples. In the future it may be possible to 
compare exposed sample sets to a national database 
of human-expression data, thus eliminating the 
need to provide an unexposed sample from the same 
individual. Efforts to develop such a national gene- 
expression database are currently under way (44,45). 
However, this national database approach will re- 
quire a better understanding of genome-wide gene 
expression across the highly diverse human popu- 
lation and of the effects of environmental factors 
on this expression. 
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which data will be collected bv different m, 
nes will make large-scale data analysis * b ? rat °- 
hcult To help circumvent these hi tur e pEE*'* 
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tate data entry into the national database air 
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Abstract 

Recent progress in genomics and proteomics technologies has created a unique opportunity to significantly impact 
the pharmaceutical drug development processes. The perception that cells and whole organisms express specific 
inducible responses to stimuli such as drug treatment implies that unique expression patterns, molecular fingerprints, 
indicative of a drug's efficacy and potential toxicity are accessible. The integration into state-of-the-art toxicology of 
assays allowing one to profile treatment-related changes in gene expression patterns promises new insights into 
mechanisms of drug action and toxicity. The benefits will be improved lead selection, and optimized monitoring of 
drug efficacv and safety in pre-clinical and clinical studies based on biologically relevant tissue and surrogate markers. 
. £ 2000 Elsevier Science Ireland Ltd. All rights reserved. 
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1. Introduction 

The majority of drugs act by binding to protein 
targets, most to known proteins representing en- 
zymes, receptors and channels, resulting in effects 
such as enzyme inhibition and impairment of 
signal transduction. The treatment-induced per- 
turbations provoke feedback reactions aiming to 
compensate for the stimulus, which almost always 
are associated with signals to the nucleus, result- 
ing in altered gene expression. Such gene expres- 
sion regulations account for both the 
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pharmacological action and the toxicity of a drug 
and can be visualized by either global mRNA or 
global protein expression profiling. Hence, for 
each individual drug, a characteristic gene regula- 
tion pattern, its molecular fingerprint, exists 
which bears valuable information on its mode of 
action and its mechanism of toxicity. 

Gene expression is a multistep process that 
results in an active protein (Fig. 1). There exist 
numerous regulation systems that exert control at 
and after the transcription and the translation 
step. Genomics, by definition, encompasses the 
quantitative analysis of transcripts at the mRNA 
level, while the aim of proteomics is to quantify 
gene expression further down-stream, creating a 
snapshot of gene regulation closer to ultimate cell 
function control. 
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2. Global mRNA profiling 

Expression data at the mRNA level can be 
produced using a set of different technologies 
such as DNA microarrays. reverse transcript 
imaging, amplified fragment length polymorphism 
(AFLP), serial analysis of gene expression 
(SAGE) and others. Currently. DNA microarrays 
are very popular and promise a great potential. 
On a typical array, each gene of interest is repre- 
sented either by a long DNA fragment (200-2400 
bp) typically generated by polymerase chain reac- 
tion (PCR) and spotted on a suitable substrate 
using robotics (Schena et al., 1995; Shalon et aL 
1996) or by several short oligonucleotides (20-30 
bp) synthesized directly onto a solid support using 
photolabile nucleotide chemistry (Fodor et aL. 
1991: Chee et al.. 1996). From control and treated 
tissues, total RNA or mRNA is isolated and 
reverse transcribed in the presence of radioactive 
or fluorescent labeled nucleotides, and the labeled 
probes are then hybridized to the arrays. The 
intensity of the array signal is measured for each 
gene transcript by either autoradiography or laser 
scanning confocal microscopy. The ratio between 
the signals of control and treated samples reflect 
the relative drug-induced change in transcript 
abundance. 



3. Global protein profiling 

Global quantitative expression analysis at the 
protein level is currently restricted to the use of 
two-dimensional gel electrophoresis. This tech- 
nique combines separation of tissue proteins by 
isoelectric focusing in the first dimension and by 
sodium dodecyl sulfate slab gel electrophoresis- 
based molecular weight separation on the second, 
orthogonal dimension (Anderson et al.. 1991). 
The product is a rectangular pattern of protein 
spots that are typically revealed by Coomassie 
Blue, silver or fluorescent staining (Fig. 2). 
Protein spots are identified by mass spectrometry 
following generation of peptide mass fingerprints 
(Mann et al.. 1993) and sequence tags (Wilkins et 
al.. 1996). Similar to the mRNA approach, the 
ratio between the optical density of spots from 
control and treated samples are compared to 
search for treatment-related changes. 

4. Expression data analysis 

Bioinformatics forms a key element required to 
organize, analyze and store expression data from 
either source, the mRNA or the protein level. The 
overall objective, once a mass of high-quality 
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Fig. 1. Production of an active protein is a multistep process in which numerous regulation systems exert control at various stages 
of expression. Molecular fingerprints of drugs can be visualized through expression profiling at the mRNA level (genomics) using 
a variety of technologies and at the protein level (proteomics) using two-dimensional gel electrophoresis. 




liver homoacnuie. 



5. Comparison of global mRNA and protein 
expression profiling 

There are several synergies and overlaps of data 
obtained by mRNA and protein expression analy- 
sis. Low abundant transcripts may not be easily 
quantified at the protein level using standard two- 
dimensional gel electrophoresis analysis and their 
detection may require prefracuonation of sam- 
ples The expression of such genes may be prefer- 
ably quantified at the mRNA level using 
techniques allowing PCR-mediated target amplifi- 
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quantitative expression data has been collected, is 
to visualize complex patterns of gene expression 
changes, to detect pathways and sets ot genes 
iiahtlv correlated with treatment efficacy and toxi- 
city and to compare the effects of different sets of 
treatment (Anderson et aL 1996). As the drug 
effect database is growing, one may detect similar- 
ities and differences between the molecular finger- 
prints produced by various drugs, information 
that mav be crucial to make a decision whether to 
refocus or extend the therapeutic spectrum of a 
drue candidate. 
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cation. Tissue biopsy samples typically yield good 
quality of both mRNA and proteins: however, the 
quality of mRNA isolated from body fluids is 
often poor due to the faster degradation of 
mRNA wfieh compared with proteins. RNA sam- 
ples from body fluids such as serum or urine are 
often not very 'meaningful*, and secreted proteins 
are likely more reliable surrogate markers for 
treatment efficacy and safety. Detection of post- 
radiational modifications, events often related to 
function or nonfunction of a protein, is restricted 
to protein expression analysis and rarely can be 
predicted by mRNA profiling. Information on 
subcellular localization and translocation of 
proteins has to be acquired at the level of the 
protein in combination with sample prefractiona- 
tion procedures. The growing evidence of a poor 
correlation between mRNA and protein abun- 
dance (Anderson and Seilhamer. 1997) further 
suggests that the two approaches. mRNA and 
protein profiling, are complementary and should 
be applied in parallel. 

6. Expression profiling and drug development 

Understanding the mechanisms of action and 
toxicity, and being able to monitor treatment 
efficacy and safety during trials is crucial for the 
successful development of a drug. Mechanistic 
insights are essential for the interpretation of drug 
effects and enhance the chances of recognizing 
potential species specificities contributing to an 
improved risk profile in humans (Richardson et 
aL 199?; Steiner et al.. 1996b: Aicher et aL 1998). 
The value of expression profiling further increases 
when links between treatment-induced expression 
profiles and specific pharmacological and toxic 
endpoints are established (Anderson et al.. 1991. 
1995. 1996: Steiner et al. 1996a). Changes in gene 
expression are known to precede the manifesta- 
tion of morphological alterations, giving expres- 
sion profiling a great potential for early 
compound screening, enabling one to select drug 
candidates with wide therapeutic windows 
reflected by molecular fingerprints indicative of 
high pharmacological potency and low toxicity 
(Arce et al.. 1998). In later phases of drug devel- 



opment, surrogate markers of ::eatmeni eff.cacv 
and toxicity can be applied to optimize the moni- 
toring of pre-clinical and clinical studies t Dohertv 
et aL 1998). 



7. Perspectives 

The basic methodology of safety evaluation has 
changed little during the past decades. Toxicity in 
laboratory animals has been evaluated primarily 
by using hematological, clinical chemistry ind 
histological parameters as indicators of organ 
damage. The rapid progress in genomics and pro- 
teomics technologies creates a unique opportunity 
to dramatically improve the predictive power of 
safety assessment and to accelerate the drug devel- 
opment process. Application of gene and protein 
expression profiling promises to improve lead se- 
lection, resulting in the development of drug can- 
didates with higher efficacy and lower toxic ; "\ 
The identification of biologically relevant surro- 
gate markers correlated with treatment efficacy 
and safety bears a great potential to optimize the 
monitoring of pre-clinical and clinical trails. 
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Application of DNA Arrays to Toxicology 
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DNA amy technology makes it possible to rapidly genotype individuals or quMBtiSy the expression 
of thousands of genes on a single filter or gUss slide aad hoids enormous potential in toxicologic 
applications. This potential led to a U.S. Eflvironmental Protection Agency-sponsored workshop 
titled "Application of Microamrys to Toxkolo^" on 7-8 January 1999 is fUs*arch Triangle Park. 
North Carolina. In addition to providing state-of-the-art information on the application of DNA or 
gene tnieroarrays. the workshop catalyzed the formation of several collaborations, cornrnirtees. and 
user s groups throughout the Research Triangle Park area aad beyond. Potential application of 
microarravs to toxicologic research and risk assessment include gtuoait-wide expression analyses to 
identify gene-expression networks and toxicant-specific signatures that can be used to define mode 
of action, for exposure assessment, and for environmental monitoring. Arrays may also prove useful 
for monitoring genetic variability and its relationship to toxicant susceptibility in human popula- 
tions. Key words: DNA arrays, gene arrays, microarravs, toxicology. Environ Hesixh Perrprct 
107:681-685 (1999). [Online 6 July 1999] 



Decoding the genetic blueprint is a dream that 
orren manifold returns in terms of understand- 
ine How organisms develop and function in an 
orten hostile environment. With the rapid 
advances in molecular biology over the last 30 
vears. the dream has come a step closer to reali- 
ty. Molecular biologists now have the ability to 
elucidate the composition of any genome. 
Indeed, almost 20 genomes have already been 
sequenced and more than 60 are currently 
under way. Foremost among these is the 
Human Genome Mapping Protect. However, 
the genomes or a number or commonly used 
laboratory species arc also under intensive 
investigation, tnciuding yeast. Araotdopsu. 
maize, rice, zebra fish, mouse, rat. and dog. It 
is widely expected that the completion of such 
programs wilt facilitate the development ot 
manv powcrfui new techniques and approach- 
rs to uiasnosing ana treating geneucaiiv and 
environmentally induced discasn which amia 
mankind. However, the vast amount of data 
being generated by genome mapping will 
require new high- throughput technologies to 
investigate the function of the millions of new 
genes that are being reported. Among die most 
widely heralded of the new functional 
scnomics technologies axe DNA arrays, which 
represent perhaps the most anticipated new 
molecular biolog/ tedinique since polymerase 
chain reaction (PCR). 

Arrays enable the study of literally thou- 
sands of genes in a single experiment. The 
potential impo ranee of arrays is enormous and 
has been highlighted by the recent publication 
of an entire Nozure Grnrna supplement dedi- 
cated to the technology ( /). Despite this huge 
surge of interest, DNA arrays are still little used 
and largely unproven. as demonruaxed by die 
high ratio of review and press articles to actual 
data papers. Even so. the potential they offer 



has driven venture capitalists into a frenzy of 
investment and many new companies are 
springing up to claim a share of this rapidly 
developing market. 

The U.S. Environmental Protection 
Agency (EPA) is interested in applying DNA 
array technology to ongoing toxicologic stud- 
ies. To learn more about the current state of 
the technology, the Reproductive Toxicology 
Division (RTD) of the National Health and 
Environmental Effects Research Laboratory 
{NHEERL Research Triangle Park. NO 
hosted a workshop on "Application of 
Microarravs to Toxicology" on T-8 January 
1999 in Research Triangle Park. North 
Carolina. The workshop was organized by 
David Dix. Robert Kavlock. and John Roekett 
of the RTD/NHEERL Twenty-two intra- 
mural and extramural scientists rVom govern- 
ment, acaoerrua. ana inausuv shared inrorma- 
uon. data, and opinions on the current and 
future applications tor this exaring new tech- 
nology. The workshop had more than 1 50 
attendees, including researchers, students, and 
— administrators from the EPA. the- National 
Institute of Environmental Health Sciences 
(NIEHS). and a number of other establish- 
ments from Research Triangle Park and 
beyond. Presentations ranged from the tech- 
nology behind array production through the 
sharing of actual experimental data and projec- 
tions on the future importance and applica- 
tions of arrays. The information contained in 
the workshop presentations should provide aid 
and insight into arrays in general and their 
application to toxicology in pamcular. 

Array Elements 

In the context of molecular biology', the word 
"array" is normally used to refer to a series of 
DNA or protein elements firmly attached in 



a regular partem to some kind of supportive 
medium. DNA arrav is orten usee inter- 
chaneeabiv with gene arrav or microarray. 
Although nor formaiiv denned, microarrav is 
generally used to describe the higher density 
arrays typicallv printed on glass chips. The 
DNA elements thai make up DNA arravs 
can be oligonucleotides, partial gene 
sequences, or rull-iength cDNAs. Companies 
offering pre-made arravs that contain iess 
than full-length ciones normally use regions 
of the genes which are specific to that gene to 
prevent false positives arising through cross- 
hybridization. Sequence verification of 
cDNA clone identity is necessary because of 
errors in identifying specific clones from 
cDNA libraries and databases. Premade 
DNA arrays printed on membranes are cur- 
rently or imminently available lor human, 
mouse, and rat. In most cases they contain 
DNA sequences representing several thou- 
sand different sequence clusters or genes as 
delineated through the National Center for 
Biotechnology Information UniGene Project 
(J). Many of these different UniGene dusters 
(putative genes) are represented only by 
expressed sequence tags (ESTs). 

Array Printing 

Arrays are typically printed on one of two 
types of support matrix. Nylon membranes 
are used bv most off-the-shelf array providers 
such as Clontech Laboratories, Inc. 
(Palo Alto, CA). Genome Systems, Inc. (Sl 
Louis. MOK and Research Genetics. Inc. 
(Huntsville. AL). Microarravs such as diose 
produced by Arrymetru. Inc. (Sana Clara. 
CAi. Incite Pharmaceuucals. inc. (Palo Alto. 
CA). and many do-it-yourself (DIY) arraying 
groups use glass wafers or slides. Although 
standard microscope slides may be used, they 
musi be preprepared to facilitate sticking 
of the DNA to the glass. Several different 
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coatings have been successfully usee, includ- 
ing silane and iysme. The coating of slides 
can easily be carried out in the laboratory, 
bur many prefer the convenience ofprecoar.ee 
slides available from suppiien. 

Once the support matrix has been pre- 
pared, the DNA elements can be applied by 
several methods. Afrymetrix. Inc.. has devel- 
oped a unique photolithographic technology 
for attaching oligonucleotides to glass waxen. 
More commonly. DNA is applied by either 
noncontacr or contact printing. Noncontacx 
printers can use thermal, solenoid, or piezoelec- 
tric technology to spray aiiquots of solution 
onto the support matrix and may be used to 
produce slide or membrane-based arrays. 
Cartesian Technologies. Inc. (Irvine. CA) has 
developed nQUAD technology for use in its 
P ulS vs printers. The system couples a syringe 
pump with the microsoienoid valve, a combi- 
nation that provides rapid quantitative dispens- 
ing of nanoliter volumes (down to h2 nU over 
a vanabie volume range, A different approach 
to noncontact printing uses a solid pin and ring 
combination (Genetic MicroSvstems, Inc.. 
Wooum, MA). This system (Figure 1 ) allows a 
broader range of sample, including cell suspen- 
sions and particulates, because the printing 
head cannot be blocked up in the same way as 
a spray nozzle. Fluid transfer is controlled in 
this system primarily by the pin dimensions 
and the force of deposition, although the 
nature of the support matrix and the sample 
will also arrea transfer to some degree. 

In contact printing, the pin head is dipped 
in the sample and then touched to the support 
matrix to deposit a small aliquot. Split pins 
we one of the first contact-printing devices 
:o be reported and are the suggested format 
: or DIY arrayers. as described by Brown (J). 
Split pins are small metal pins with a precise 
rroove cut vertically in the middle of the pin 
:id. In this system, i— 48 split pins are posi* 
ioned in the pin-head Tne split pins work bv 
..impie capillary acoon. not unlike a fountain 
?en — when the pin heads are dipped in the 
Ample, liquid is drawn into the pin groove. A 
:mall (fixed) volume is then deposited each 
ime the split pins are gently touched to 
he support matrix. Sample (100-500 pL 
iepending on a variety of parameters) can be 
ieposited on multiple slides before refilling is 
eouired. and array densities of > 2.500 
pots/cm 2 may be produced. The deposit vol- 
ime depends on the split size, sample fluidi- 
y. and the speed of printing. Split pins are 
datively simple to produce and can be made 
n-house if a suitable machine shop is avail - 
ble. Alternatively, they can be obtained 
.irectly from companies such as TeJeChcm 
nternational. Inc. (Sunnyvale. CA). 

Irrespective of their source, printers 
hould be run through a preprint sequence 
rior to producing the actual experimental 



arrays; the first 100 or so spots of a new run 
tend to be somewhat vanabie. Factors erten- 
ing spot rep rod ucibiiirv include slice treat- 
ment homogeneity, sample differences, and 
instrument errors. Other factors that come 
into play include dean ejection of the drop 
and clogging in QUAD printing' and 
mechanical variations and long-term alter- 
ation in pnnt-hcad surface of solid and split 
pins. However, with careful preparation it is 
possible to get a coefficient of variance for 
spot reproducibility below 10V 

One potential printing problem is sample 
carryover. Repeated washing, blotting, and 
drying (vadium) of print pins between samples 
is normally effective at reducing sample carry- 
over to negligible amounts. Printing should 
also be carried out in a controlled environ- 
ment. Humidified chambers are available in 
which to place printers. These help prevent 
dust contamination and produce a uniform 
drying rate, which is important in determining 
spot sac. quality, and reproducibility. 

In summary, although several printing 
technologies are available, none are par- 
ticularly outstanding and the bottom line 
is that they are still in a relatively eariy stage 
of evolution. 

Array Hybridization 

The hybridization protocol is. practically 
speaking, relatively straightforward and those 
with previous experience in blotting should 
have little difficulry. Array hybridizations 
are. in essence, reverse Southern/Northern 
blots— instead of applying a labeled probe to 
the target population of DNA/RNA. the 
labeled population is applied to the probeis). 
With membrane-based arrays, the control and 
treated mRNA populations are normailv con- 
verted to cDNA and labeled with isotope (e.g.. 
- P) in die process. These labeled populations 
are tnen nvbndizea indeDencienuv 10 oarailei 
or senai arrays and the hvbndaanon soui is 
d erected with a phosponmager. A less com- 
monly used alternative to radioactive probes is 
enzymatic detection. The probe may be 
biotinylated. haptenylated^r have alkaline 
phosphatase/horseradish peroxidase attached. 
Hybridization is detected by enzymatic reac- 
tion yielding a color reaction (4). Differences 
in hybridization signals can be detected bv eye 
or. more accurately, with the help of digital 
imaging and cornmeraally available software. 
The labeling of the test populations for slide- 
based microarrays uses a slightly different 
approach. The probe typically consists of two 
samples of polyA* RNA (usually from a created 
and a comrol population) that are converted zo 
cONA; in the process each is labeled with a 
different fluor. The independently labeled 
probes are then mixed together and hybridized 
to a single microarray slide and the resulting 
combined fluorescent signal is scanned. After 
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Figure 1. Genetic Microsystems (Wodum. MAI pin 
nng system tor printing amy*. The pin ring com- 
bination consists of a circular open rmg oriented 
parallel to the samoie solution, with a vertical pm 
centered over tne nng. When the nng is dtpotd 
into a solution and lifted, it withdraws an aliquot 
of sample held bv surface tension. To spot the 
sample, the pm ts driven down through the rmg 
and a portion of the solution is transferred to the 
bottom of the ptn. The pm continues to move 
downward until the pendant drop of solution 
makes contact with the underlying surface. The 
pin is then lifted, and gravity and surface tension 
cause deposition of the spot onto the array. 
Figure from Flowers et al. (14), with permission 
from Genetic Microsystems. 

normalization, it is possible to determine the 
ratio of fluorescent signals from a single 
hybridization of a slide-based microarray. 

cDNA derived from control and treated 
populations of RNA is most commonly 
hybridized to arravs. although subtractive 
hvbridization or differentia] display reactions 
mav also be used. Fluorophore- or radiola- 
beled nucleotides are dxrecuy incorporated 
into the cDNA in the process of convening 
RNA to cDNA. Alternatively. 5' end-labeled 
primers may be used for cDNA synthesis. 
These are labeled with a fluorophore for 
direct visualization of the hybridized array. 
AJternativeiy. biotin or a hapten may be 
attached to the primer, in which case fluor- 
labeled streptavidin or antibody must be 
applied before a signal can be generated. The 
most commonly used fluorophores at present 
are cyanine (Cy)3 and Cy5 (Amersham 
Pharmacia Biotech AB. Uppsala, Sweden). 
However, the relative expense of diese fluo- 
rescent conjugates, has driven a search for 
cheaper alternatives. Fluorescein, raodaminc, 
and Texas red have all been used, and 
companies such as Molecular Probes. Inc. 
(Eugene. OR) are developing a series of 
labeled nucleotides with a wide range of exci- 
tation and emission spectra which may prove 
to function as well as the Cy dyes. 
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Analysis of DNA Microarrays 

Membrane-based arrays arc normally analyzed 
on fiim or with a phosphonmager. whereas 
chip-based arrays require more sprmlrTrrf son- 
rung devices. These can be divided into ouee 
main groups: the charge-coupied device camera 
systems, the nonconfocal laser scanners, and the 
conibcaJ laser scanners. The advantages and dis- 
advantages or each system are listed in Tabic 1 . 

Because a typical spot on a microarray can 
contain > 10 6 moiecuies. it u clear that a large 
variation in signal strength may occur. 
Current scanners cannot work across this 
many orders of magnitude (4 or 5 is more typ- 
ical). However, the scanning parameters can 
normally be adjusted to collect more or less 
sienal. such that two or three scans of the same 
array should permit the detection of rare and 
abundant genes. 

When a microarray is scanned, the fluores- 
cent images arc caprured by software normally 
included with the scanner. Several commercial 
suppliers provide additional sorrware tor quan- 
tifying array images, but the sorrware tools are 
constantly evolving to meet the developing 
needs of researchers, and it is prudent to 
define one's own needs and dariry the exact 
capabilities of the sorrware before its purchase, 
issues that should be considered include the 
following: 

• Can the software locate orrset spots? 

4 Can it quanntate across irregular hybridiza- 
tion signalsr 

• Can the arrayed genes be programmed in for 
easy identification and location? 

• Can the software connect via the Internet to 
databases containing further information on 
the gene(s) of interest? 

One of the key issues raised at the work- 
shop was the sensitivity of microarray technol- 
ocv. Experimeno by General Scanning. Inc. 
^ itertowru MA), have shown that by using 
:he Cy dyes and their scanner, signal can be 
detected down to levels of < I fiuor molecule 
per square micrometer, which translates to 
deiecting a rare message at ar^roximatdy one 
copy per cell or less. 

Array Applications 

Although arrays are an emerging technology 
certain to undergo improvement and 
alteration, they have already been applied use- 
rullv to a number of model systems. Arrays are 
at their most powerful when they contain the 
entire genome of the species they are being 
used to study. For this reason, they have strong 
support among researchers urifiring yeast and 
Qtmorhabdim elegans (5). The genomes of 
both of these species have been sequenced and. 
in the case of yeast, deposited onto arrays for 
examination of gene expression (£ 7}. With 
both of these species, it is relatively easy to 
perturb individual gene expression, indeed. C 



CCD. cnirpt-covotto otvici. 
From Kawasaki ( f J). 

elegans knockouts can be made simply by 
soaking the worms in an antisense solution of 
the gene to be knocked out. 

By a process of systematic gene disrup- 
tion, it is now possible to examine the cause 
and effect relationships between different 
gens in these simple organisms. This kind of 
approach should help elucidate biochemical 
pathways and genetic control processes, 
deconvoluce polygenic interactions, and 
define the architecture of the cellular network. 
A simple case study of how this can be 
achieved was presented by Butow [University 
of Texas Southwestern Medical Center. 
Dallas, TX (Figure 2)]. Although it is the 
phenotypic result of a single gene knockout 
that is being examined, the effect of such 
perturbation will almost always be polygenic. 
Polygenic interactions will become increasing- 
ly important as researchers begin to move 
away from single gene systems when examin- 
ing the nature of toxicologic responses to 
external stimuli This is especially important 
in toxicology because the phenorype pro- 
duced by a given environmental insult is 
never the result of the action of a single gene, 
rather, it is a complex interaction of one or 
multiple cellular pathways. Phenomena such 
as quantitative trait (the continuous \ananor. 
or pheaoTypcJ, eptrosis tthe errecr of aiieies of 
one or more genes on the expression or otner 
genes), and penetrance ipropomon of indi- 
viduals of a given genotype that display a par- 
ticular phenorype) will become increasingly 
evident and Important as lexicologists push 
toward the ultimate goal of matching the 
responses of individuals to different 
environmental stimuli. 

Analysis of the transcriptome (the expres- 
sion levd of all the genes in a given cell popula- 
tion) was a use of arrays addressed by several 
speakers. Unfortunately, current gene nomen- 
clature is often confusing in that single genes 
are allocated multiple names (usually as a result 
of independent dis co v e ry by difrerent laborato- 
ries), and there was a call for standardization of 
gene nomenclature. Nevertheless, once a nan- 
scriptome has been assembled h can then be 
transfer ted onto arrays and used to screen any 
chosen system. The EPA MicroArray 
Consortium (EPAMAQ is assembling teste 



rxanscriptomes for human, rat. and mouse, in a 
slighdy different approach. Nuwavstr et ai. t# 
describes how the NIEHS assembled what is 
enectiveiy a "toxicoioeical transenprome" — a 
library of human and mouse genes that have 
previously been proven or implicated in 
responses to toxicologic insults. Clontech 
Laboratories. Inc. (Palo Alto. CA). has begun a 
similar process by developing stress/ toxicology 
filter arrays of rat, mouse, and human genes. 
Thus, rather than being tissue or cell srxanc. 
these stress/ toxicology amvs can be used across 
a variety of model systems to look for alter- 
ations in the expression of toxicologtcally 
important genes and dehne the new field of 
toxicogenomia. The potential to identify toxi- 
cant families based on tissue- or cell-specific 
gene expression could revolutionize drug test* 
inc. These molecular signatures or fingerprints 
could not only point to the possible 
toxicity/carcinogeniciry of newly discovered 
compounds (Figure 5). but also aid in elucidat- 
ing their mechanism of action through identifi- 
cation of gene expression networks. By exten- 
sion, such signatures could provide easily iden- 
tifiable biomarkers to assess the degree* time, 
and nature of exposure. 

DNA arravs are primarily a tool for cam- 
ming differential gene expression in a erven 
model. In this context then- are lcitu e u to as 
dosed systems because they lack the ability of 
other differencial expression technologies. e£.. 
differential display and subtnethne hybridiza- 
tion, to detect previously unknown genes not 
present on the array. This would appear to 
limit the power of DNA arrays to rhe imagina- 
tions and preconceptions of the researcher in 
selecting genes previously characterized and 
thought to be involved in the model svsiem. 
However, rhe various genome sequencing pro- 
jects have created a new category of 
sequence — the EST — that has fjarually molli- 
fied this deficiency. ESTs are cDNAs expressed 
in a given tissue that, although they may share 
some degree of sequence similarity eo previous* 
ly characterized genes, have not been asserted 
specific genetic identity. By incorporating EST 
dones into an array, it is possible do monitor 
the expression of these unknown genes. This 
can enable the identification of previously 
uncharacterixed genes that may have biologic 



significance in the model svstem. Filter arravs 
from Research Genetics and siid: arrays from 
inmr Pharmaceuticals bothincorporate.iargc 
numDen of ESTs rrom a variety or" species. 

A runner use of rniaoarnys is the identifi- 
cation of singie nucleotide poivmorphisms 
•SNPs:. These genomic variations are abun- 
dant — thev occur apprtndrnatery every 1 kb or 
so— anc are the basis of restriction fragment 
length poivmorphism anaivsu used in forensic 
anahiii. Anymemx. Inc. designed chips that 
contain multiple repears of the same gene 
sequence. Each position is present with ail four 
possible bases. .After the hybridization of the 
sample, the degree of hybridization to the dif- 
ferent sequences can be measured and the exact 
sequence of the target gene deduced. SNPs are 
thought to be of vital importance in drug 
metabolism and toxicology. For example, sin* 
de base differences in the regulatory region or 
sctive sire of some genes can account for huge 
iirccrences in the activity of that gene. Such 
SNPs are thought to explain whv some people 
ire abie to metaooiize certain xenobiotics bet- 
:er than others. Thus, arrays provide a further 
ool for the toxicoiogist investigating the 
mature of susceptible subpopulations and toxi- 
rologic response. 

There are still many wrinkles to be ironed 
>ut before arrays become a standard tool rbr 
oxicologists. The main issues raised at the 
vorkshop by those with hands-on experience 
vere the following: 

* Expense: the cost of purchasing/contracting 
this technology is still too great for many 
individual laboratories. 
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igure 2. Potential t fleets of gene knockout wrthin 
osuiveiy and negatively regulated gene expression 
etworxs. i y is limrung in wild type for expression of 
. i4) A simple, two-component, linear regulatory 
etwou operating on gene where /, rs a positive 
hector of ij arte" /„ is either a positive or negative 
hector of v This network could be deduced by 
xa mining the conseouente of i& deleting j n on the 
xpression of *, and ^ where the expression of ^ 
/ouid be decreased or increased depending on 
/newer / fl was a positive or negative regulator, 
hese and other connected components of even 
reater complex*/ could be revealed by genome- 
nde expression analysts. From Buiow ( /5). 



• Cones: the iogisnaof idcntining. obtaining, 
and maintaining a set of nonredundant. non- 
contaminated, sequence-venncd. species/ ceii' 
ussue.'nda -specific doncs. 

• Use of inbred strains: where whole-organism 
models are being used, the use of inbred 
strains is important to reduce the potentoiiv 
contusing effects of the individual variation 
typically seen in outbred popuiauons. 

• Probe: the need rbr relatively large amounts 
of RNA. which limits the type of sample 
(e.g.. biopsy* that can be used. .Also, different 
RNA extraction methods can give different 
results. 

• Specificity: the ability to discriminate accu- 
rately between dosdy related genes (e.g.. the 
cytochrome p-*50 family) and splice variants. 

• Quantitation: the quantitation of gene 
expression using gene arrays is still open to 
debate. One reason for this is the different 
incorporation of the labeling dyes. However, 
the main difficulty lies in knowing what to 
normalize against. One option is to include a 
large number of so-called housekeeping genes 
in the array. However, the expression of these 
genes orten change depending on the tissue 
and the toxicant, so it is necessarv to charac- 
terize the expression of these genes in the 
model system before utilizing them. This is 
clearly not a viable option when screening 
multiple new compounds. A second option 
is to include on the array genes rrom a nonre- 
lated species (e.g.. a plant gene on an animal 
array) and to spike the probe with synthetic 
RNA(s) complementary to the genefs). 

• Reproducibility: this is sometimes question- 
able, and a figure of approximately rwo or 
three repeats was used as the minimum num- 
ber required to confirm initial findings. 
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.Again, however, most r*cor.* - 
use of Nortnem biots or rrvrrsr :ror.w:r:jjr 
P£R to connrrr. rindir.p - ^ ^ 

• Sensitivity: concerns wcrr vo;;cj j?ou: 
"number of target moieruie that mus: rx rrr- 

sent in a sampir for them to rx ortectec or, 
the array. 

• Efficiency: reproducible identification of 1.^- 
to 2-fold CjrTerences tn expression *u report- 
ed, although the number of genes that 
undergo this levci of cnange and remain 
undetected u open to debate. I; u imponan; 
that this ievel of detection be ultimateiv 
achieved because it is commoniy perccneJ 
that some important transcription factors 
and their regulators respond at such low io- 
eis. In most cases. > to Wolu waj the mini- 
mum change that most were happv to 
accept. 

• Biomformatics: perhaps th? greatest concern 
was how to accuraieiv interpret the data with 
the greatest accuracy and efficiency. The 
biggest headache is trving to identify net- 
works ot gene expression that are common to 
dirtcrent treatments or uo*cs. Tne amount of 
data rrom j singie experiment is huge. It may 
be that, in the future, several groups tndivul* 
uallv equipped with specialized software algo- 
rithms tor studying their favorite genes or 
gene systems will be able to share the same 
hybridized chips. Thus, arravs could usher in 
a new perspective on collaboranon and the 
sharing of data. 

EPAMAC 

Pcrhapj the mjm reason most scientists are 
unable to use array technology is the high cost 
involved, whether buying off-the-shelf mem- 
branes, usini; contract printing services, or 
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figure 1 Gene expression profiles— also called fingerprints or signatures— of known toxicants or toxi- 
cant families may. in the future, be used to identity the potential toxicity of new drugs, etc. in this exam- 
ple, the genetic signature of test compound 1 is identical to that of known peroxisome proliferaton. 
whereas that of ten compound I does not match any known toxicant family. Based on these rttutts ti« 
compound 2 would be reumtd for further testing and ten compound 1 would be eliminated 
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producing chips in*house. In view o: this, 
researchers at the RTD'NHEERL initiated 
the EPAMAC. This consortium brines 
together scientists rrom the EPA and a num- 
ber of extramural labs with the aim of devel- 
oping microanay capability through the shar- 
ing or resources and data. EPAMAC 
researchers arc primarily interested in the 
developmental and toxicologic changes seen 
in testicular and breast tissue, and a portion 
of the workshop was set aside for EPAMAC 
members to share their ideas on how the 
experimental application of microarrays could 
facilitate their research. One of the central 
areas of interest to EPAMAC mem ben is the 
effect of xenobiotics on male fertility and 
reproductive health. Of greatest concern is 
the effect of exposure during critical periods 
of development and germ cell differentiation 
(5J). and how this may compromise sperm 
counts and quality following sexual matura- 
tion [JO). As well as spermatogenic tissue, 
there is also interest in how residual mRNA 
round in mature sperm 1 77) could bejised as 
an indicator of previous xenobiotic effects (it 
is easier to obtain a semen sample than a tes- 
ticular biopsy}. Arravs will be used to examine 
and compare the effect of exposure to heat 
and chemicals in testicular and epididvmaJ 
gene expression profiles, with the aim of 
establishing relationships/associations 
between changes in developmental landmarks 
and the effects on sperm count and qualm*. 
Cluster, pattern, and other analysis of such 
data should help identify hidden relationships 
between genes that may reveal potential 
mechanisms of anion and uncover roles for 
genes with unknown functions. 

Summary 

The full impact of DNA arrays may not be 
*cen for severaj vean. but the interest shown at 
:ms reponai workshoo indicates the high Icvei 
of interest thai mey roster. Apart rrom educat- 
ing and advertising the vinous technologies in 
this field, this workshop brought together a 
number of researchers from the Research 
Triangle Park area who are already using DNA 
arrays. The interest in sharing ideas and experi- 
ences led to the initiation of a Triangle array 
user s group. 



Amy technow* is still in its money. This 
means that the hardware is still improving and 
there is no current consensus for standard pro- 
cedures, quantitation, and interpretation. 
Consistency* in spotting and scanning arm's is - 
not yet optimized, and this is one of the most 
critical requirements of any experiment. ln_ . 
addition, one of the dark regions or array tech- 
nology — strire in the courts over who owns 
what portions of it — has further muddled the 
nature and is a potential barrier toward the 
development of consensus procedures. 

Perhaps the greatest hurdle tor the applica- 
tion of arrays is the actual interpretation of 
data. No specialists in bioinrortnatio attended 
the workshop. Largely hrrausc they are rare and 
because as yet no one seems dear on the best 
method of approaching data analysis and inter- 
pretation. Cro^s- referencing results from mul- 
tiple experiments (time, dose, repeats, different 
animals, different species) to identify common- 
ly expressed genes is a great challenge. In most 
cases, we are sail a long way from undemand- 
ing how the expression of gene X is related to 
the expression of gene K and ordering gene 
expression to delineate causal relationships. 

To the ordinary scientist in the typical lab- 
oratory, however, the most immediate prob- 
lem is a lack of affordable instrumentation. 
One can purchase premade membranes at 
relatively affordable prices. Although these 
may be useful in identifying individual genes 
to pursue in more detail using other methods, 
the numbers that would be required for even a 
small routine toxicology experiment prohibit 
this as a truly viable approach. For the toxicol- 
ogist. there is a need to earn* out multiple 
experiments— dose responses, time curves, 
multiple animals, and repeats. Glass-based 
DNA arrays are most attractive in this context 
because they can be prepared in large batches 
from the same DNA source and accommo- 
date control and created samples on the same 
chip. Anotner prootem witn current off-the- 
shelf arrays is that they often do not contain 
one or more of the particular genes 3 group is 
interested in. One alternative is to obtain 
and/or produce a set of custom clones and 
have contract printing of membranes or slides 
carried out by a company such as Genomic 
Solutions. Inc. (Ann Arbor. MI). This approach 
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is less expensive :har. ;jvir.; ou: ^r.:. * * 
one s own entire svuem. a.rnoucr. i: >^rr.. 
point it mien: make cconorr..* »?r.>r 
one s own arravs. 

Finally. DNA anavs are currentiv j tea.- 
erTort. Ther are a technoior* that uses j * 
.-range or skilis inciucung engineenng. statisua. 
moiecuiar bioiog\ . cnemistn . and biotntor- 
maucs. Because most individuals are skilled in 
only one or perhaps rwo of these areas, it 
appears thai success wuh arravs mav be best 
expected bv teams of collaborators conststmc 
of individuals having each of tnese skilis. 

Those considering arrav applications mav 
be amused or goaded on bv the foliowine 
quote rrom rorrunr macaiine i : 

Microprocessors nave rc»ruc*ea our econorru. 
*oawneu vast romino arte chanced the war »c uve 
Gene chips couic be even bigger 

Although this comment may have been 
designed to excite the imagination rather than 
accurately reflect the truth, it is rair to say that 
the age of functional genomics is upon us. 
DNA arravs look set to be an important tool in 
this new age of biotechnology and will Ukefy 
contribute answers to some of toxicology s 
most fundamental questions. 
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Subject: RE: [Fwd: Toxicology Chip] 
Date: Mon. 3 Jul 2000 08:09:45 -0400 
From: "Afshari.Cynthia" <afshari(S'nieh$.nih.gov> 
To: ""Diana Hamlet-Cox*" <dianahc(^incyie.com> 

You car. see the list of clones that we have on our 12?'. chip at 
http : nanus 1 .r.iehs . mh . gov naps - guest ' clonesrch . cfr. 

We selectee a subset of genes (2000K) that we believed critical :r to>: 
response and basic cellular processes and added a set of clones and ISTs :: 
this. We have included a set of control genes (80-) that were selected bv 
the KHGRI because they did not change across a large set of array 
experiments. However, we have found that some of these genes change 
signficantly after tox treatments and are in the process cf looking at the 
variation of each of these 80- genes across our experiments. 
Our chips are constantly changing and being updated and we hope that our 
data will lead us to what the toxchip should really be. 
Z hope this answers your question. 
Cindy Afshari 

> 

> From: Diana Hamiez-Cox 

> Sent: Monday,. June 26, 2000 8:52 PM 

> To: afshariGniehs.nih.gov 

> Subject: [Fwd: Toxicology Chip] 
> 

> Dear Dr. Afshari, 
> 

> Since I have not yez had a response from Bill Grigg, perhaps he was noz 

> the right person to contact. 
> 

> Can you help me in this matter? I don't need to know the sequences , 

> necessarily, buz I would like very much to know what types of sequences 

> are being used, e.g., GPCRs (more specific?) , ion channels, etc. 
> 

> Diana Hamlet-Cox 
> 

> Original Message 

> Subject: Toxicology Chip 

> Date: Mon, 19 Jun 2000 18:31:48 -0700 

> From: Diana Hamlet-Cox <dianahcGincyte . com> 

> Organization: Incyte Pharmaceuticals 

> To: griggGniehs.nih.gov 
> 

> Dear Colleague: 
> 

> I am doing literature research on the use of expressed genes as 

> pharmacotoxicology markers, and found the Press Release dated February 

> 29, 2000 regarding the work of the NIEHS in this area. 1 would ii^ce to 

> know if there is a resource I can access (or you could provide?) that 

> would give me a list of the 12,000 genes that are on your Human ToxChip 

> Microarray . In particular, I am interested in the criteria used to 

> select sequences for the ToxChip, including any control sequences 

> included in the microarray . 
> 

> Thank you for your assistance in this request. 
> 

> Diana Hamlet-Cox, Ph.D. 

> Incyte Genomics Inc. 
> 

> — 
> 

> = = = = = = = = = = =: = = = = = =- = = =- 
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E |Fwd To\uolo$> Chip | " 



> This email message is zor zae soxe use or zhe i-zer.oec recipier.z .? ~-z 

> may cor.zair. cc-fider.zial and privileged ir.formazior. subjecz zo 

> azzorxey-clier.z privilege . Any unauthorized rev* ew, use. disclosure or 

> diszribuzzor. is prohibited. If you are r.oz zhe ir.zended recipient. 

> please cor.zacz zhe sender by reply email and destroy all copies cf zhe 

> original- message: - - - ' — - 

> 
> 
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ABSTRACT Pair-wise sequence comparison methods have 
been assessed using proteins whose relationships are known 
reliably from their structures and functions, as described in 
the scop database [Murzin, A. G M Brenner, S. E., Hubbard, T. 
& Chothia C. (1995) J. MoL Biol. 247, 536-540]. The evalua- 
tion tested the programs BLAST [Altschul, S. F., Gish, W., 
Miller, W., Myers, E. W. & Lipman, D. J. (1990)./. Mot. Biol. 
215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) 
Methods EnzymoL 266, 460-480], FASTA [Pearson, W. R. & 
Lipman, D. J. (1988) Proc. Nad. Acad. Sri. USA 85, 2444-2448] , 
and ssearch [Smith, T. F. & Waterman, M. S. (1981) /. Mot. 
Biol. 147, 195-197] and their scoring schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores. The E-vaiue statistical scores of SSEARCH and FASTA are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P-values reported 
by blast and wu-blasto exaggerate significance by orders of 
magnitude, ssearch, fasta ktup = 1, and wu-blasT2 perform 
best, and they are capable of detecting almost all relationships 
between proteins whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 

Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
method's central role, it is surprising that overall and relative 
capabilities of different procedures are largely unknown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unambiguously and independently of the 
methods being evaluated. However, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however, these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 
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accordance with 18 VS.C. §1734 solely to indicate this fact. 
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Sequence comparison methodologies have evolved rapidly 
so no previously published tests has evaluated modern versions 
of programs commonly used. For example, parameters in 
blast (1) have changed, and WU-BLAST2 (2)— which produces 
gapped alignments— has become available. The latest version 
of fasta (3) previously tested was 1.6, but the current release 
(version 3.0) provides fundamentally different results in the 
form of statistical scoring. 

The previous reports also have left gaps in our knowledge. 
For example, there has been no published assessment of 
thresholds for scoring schemes more sophisticated than per- 
centage identity. Thus, the widely discussed statistical scoring 
measures have never actually been evaluated on large data- 
bases of real proteins. Moreover, the different scoring schemes 
commonly in use have not been compared. 

Beyond these issues, there is a more fundamental question: 
in an absolute sense, how well does pairwise sequence com- 
parison work? That is, what fraction of homologous proteins 
can be detected using modern database searching methods? 

In this work, we attempt to answer these questions and to 
overcome both of the fundamental difficulties that have hin- 
dered assessment of sequence comparison methodologies. 
First, we use the set of distant evolutionary relationships in the 
scop: Structural Classification of Proteins database (4), which 
is derived from structural and functional characteristics (5). 
The scop database provides a uniquely reliable set of ho- 
mologs, which are known independently of sequence compar- 
ison. Second, we use an assessment method that jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be used to aid interpretation of real 
database searches and thus provide optimal and reliable 
results. 

Previous Assessments of Sequence Comparison: Several 
previous studies have examined the relative performance of 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6, 7), who compared 
the three most commonly used programs. Of these, the Smith- 
Waterman algorithm (8) implemented in ssearch (3) is the 
oldest and slowest but the most rigorous. Modern heuristics 
have provided blast (1) the speed and convenience to make 
it the most popular program. Intermediate between these two 
is fasta (3), which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup = 1). 
Pearson also considered different parameters for each of these 
programs. 

To test the methods, Pearson selected two representative 
proteins from each of 67 protein superfamilies defined by the 
PIR database (9). Each was used as a query to search the 
database, and the matched proteins were marked as being 
homologous or unrelated according to their membership of pir 

Abbreviation: EPQ, errors per query. 
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superfamilies. Pearson found that modern matrices and "In- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith-Waterman algorithm worked 
slightly better than fasta, which was in turn more effective 
than blast. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of blast and fasta. Their test with blast 
considered the ability to detect homologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-prot database (12) and used prosite (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson's and the Henikoffc' evaluations of sequence com- 
parison, the correct results were effectively unknown. This is 
because the superfamilies in pir and prosite are principally 
created by using the same sequence comparison methods 
which are being evaluated. Interdependency of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly 
homologous, but pir places them in different superfamilies. 
The problem is widespread: each superfamily in pir 48.00 with 
a structural homolog is itself homologous to an average of 1 6 
other PIR superfamilies (16). 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a length- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this analysis 
was the hssp equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments require higher identity. (Other studies also have 
used structures (18-20), but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the blast program using the 
Karlin and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to fasta and ssearch. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24, 25), the mathematical ins- 
tability of statistical scores "is a crucial feature of the BLAST 
algorithm" (1). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28), there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function, it 
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is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein structure information com- 
bined with the comprehensive evolutionary classification in 
the scop database (4, 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The scor database 
uses structural information to recognize distant homologs, the 
large majority of which can be determined unambiguously. 
These superfamilies, such as the globins or the immunoglobu- 
lins, would be recognized as related by the vast majority of the 
biological community despite the lack of high sequence sim- 
ilarity. 

From scop, we extracted the sequences of domains of 
proteins in the Protein Data Bank (PDB) (30) and created two 
databases. One (PDB90D-B) has domains, which were all <90% 
identical to any other, whereas (PDB40D-B) had those <40% 
identical. The databases were created by first sorting all 
protein domains in scop by their quality and making a list. The 
highest quality domain was selected for inclusion in the 
database and removed from the list. Also removed from the list 
(and discarded) were all other domains above the threshold 
level of identity to the selected domain. This process was 
repeated until the list was empty. The PDB40D-B database 
contains 1,323 domains, which have 9,044 ordered pairs of 
distant relationships, or «*0.5% of the total 1,749,006 ordered 
pairs. In PDB90D-B, the 2,079 domains have 53,988 relation- 
ships, representing 1.2% of all pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked in both databases by processing with the seg program 
(27) using recommended parameters: 12 1.8 2.0. The databases 
used in this paper are avaUable from http://sss.stanford.edu/ 
sss/, and databases derived from the current version of scop 
may be found at http://scop.mrc-lmb.cam.ac.uk/scop/. 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses on distantly related proteins and reduces the 
heavy overrepresentation in the pdb of a small number of 
families (31, 32), whereas PDB90D-B (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are from PDB40D-B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general. 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a single sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

The analyses tested BLAST (1), version 1.4.9MP, and wu- 
BLAST2 (2), version 2.0a 13MP. Also assessed was the facta 
package, version 3.0t76 (3), which provided fasta and the 
ssearch implementation of Smith-Waterman (8). For 
ssearch and fasta, we used BLOSUM45 with gap penalties 
-12/-1 (7, 16). The default parameters and matrix (BLO- 
SUM62) were used for blast and wu-blash. 

The "Coverage Vs. Error" Plot To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 
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perfect separation, with all of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead one is interested in 
drawing a threshold above which there are the largest number 
of related pairs of sequences consistent with an acceptable 
error rate. 

Our procedure involved measuring the coverage and error 
for every threshold. Coverage was defined as the fraction of 
structurally determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method 
Errors per query (EPQ), an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the number of queries. Graphs of these data, called 
coverage vs. error plots, were devised to understand how 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
cover Operating Characteristic (ROC) plots (33, 34) but 
better represent the high degrees of accuracy required in 
sequence comparison and the huge background of nonho- 
mologs. 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. The EPQ measure places a premium on score consis- 
tency; that is, it requires scores to be comparable for different 
queries. Consistency is an aspect which has been largely 
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Fig. 3. Length and percentage identity of alignments of unrelated 
proteins in PDB90D-B: Each pair of nonhomologous proteins found with 
SSEARCH ts plotted as a point whose position indicates the length and 
the percentage identity within the alignment. Because alignment 
length and percentage identity are quantized, many pairs of proteins 
may have exactly the same alignment length and percentage identity 
The line shows the hssp threshold (though it is intended to be applied 
with a different matrix and parameters). 
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Reliability of Statistical Scores (PDB90D-B) 
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Fig. 4. Reliability of statistical scores in PDB90D-B: Each line shows 
the relationship between reported statistical score and actual error 
rate for a different program. E-values are reported for ssearch and 
fasta, whereas P-values are shown for blast and wu-blast2. If the 
scoring were perfect, then the number of errors per query and the 
E-values would be the same, as indicated by the upper bold line. 
(P-values should be the same as EPQ for small numbers, and diverges 
at higher values, as indicated by the lower bold line.) E-values from 
ssearch and fasta are shown to have good agreement with EPQ but 
underestimate the significance slightly, blast and wu-blast7 are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for PDB40D-B were similar to those for PDB90D-B 
despite the difference in number of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given statistical 
score. 

ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results. 
Further, it provides a clear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported by data- 
base searching programs, if the programs' estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity, which may be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second is a "raw" or 
"Smith-Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed by summing 
the substitution matrix scores for each position in the align- 
ment and subtracting gap penalties. In blast, a measure 
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related to this score Js scaled into bits. Third jsla statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 

Sequence Identity. Though it has been long established that 
percentage identity is a poor measure (35), there is a common 
rule-of-thumb stating that 30% identity signifies homology. 
Moreover, publications have indicated that 25% identity can 
be used as a threshold (17, 36). We find that these thresholds, 
originally derived years ago, are not supported by present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity; thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the many pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aligned regions. 
Despite the high identity, the raw and the statistical scores for 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorly seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. 

From the PDB90D-B analysis in Fig. 3, we learn that 30% 
identity is a reliable threshold for this database only for 
sequence alignments of at least 150 residues. Because one 
unrelated pair of proteins has 43.5% identity over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
in length before 40% is a reasonable threshold, for a database 
of this particular size and composition. 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs found by 
statistical scoring. If one measures the percentage identity in 
the aligned regions without consideration of alignment length, 
then a negligible number of distant homologs are detected. 
Use of the Hssp equation improves the value of percentage 
identity, but even this measure can find only 4% of all known 
homologs at 1% EPQ. In short, percentage identity discards 
most of the information measured in a sequence comparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1), but ln-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be very precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ. However, 
it is difficult to choose appropriate thresholds because the 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems that arise from raw scores. This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 
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Fig. 5. Coverage vs. error plots of different sequence comparison methods: Five different sequence comparison methods are evaluated each 
USi r£ S L a o^ SCores ( ° r P * vaIues )* H ) n>B40D-B database. In this analysis, the best method is the slow ssearch, which finds 1 8% of relationships 
31 1~ cd^' FACT U A ! Uf> = 1 and wu - BLASrn arc a,most ^ good. (B) PDB90D-B database. The quick wu-BDum program provides the best coverage 
at 1% fcFU on this database, although at higher levels of error it becomes slightly worse than fasta ktup = 1 and ssearch. 
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likely, its power can be attributed to its incorporation of more 
information than any other mea sure ; it takes account of the 
full substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately. 

We find that statistical scores are not only powerful but also 
easy to interpret, ssearch and fasta show close agreement 
between statistical scores and actual number of errors per 
query (Fig. 4). The expectation value score gives a good 
slightly conservative estimate of the chances of the two se^ 
quences being found at random in a given query. Thus an 
E-value of 0.01 indicates that roughly one pair of nonhomoiogs 
of this similarity should be found in every 100 different queries 
Neither raw scores nor percentage identity can be interpreted 
in this way, and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from blast also should be directly interpret- 
able but were found to overstate significance by more than two 
orders of magnitude for 1% EPQ for this database Nonethe- 
less these results strongly suggest that the analytic theory is 
fundamentally appropriate. wu-blast2 scores were more re- 
liable than those from blast, but also exaggerate expected 
confidence by more than an order of magnitude at 1% EPQ 
Overall Detection of Homologs and Comparison of aW 
rithras. The results in Fig. 5A and Table 1 show that pairwise 
sequence comparison is capable of identifying only a small 
traction of the homologous pairs of sequences in PDB40D-B 
Even ssearch with E-values, the best protocol tested, could 
tmd only 18% of all relationships at a 1% EPQ. blast which 
identifies 15%, was the worst performer, whereas 'fasta 
ktup = 1 is nearly as effective as ssearch. fasta ktup = 2 and 
WU-BLAST2 are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
s ower. ssearch is 25 times slower than blast and 6.5 times 
slower than fasta ktup = 1. wu-blast? is slightly faster than 
fasta ktup = 2, but the latter has more interpretable scores 
In PDB90D-B, where there are many close relationships, the 
best method can identify only 38% of structurally known 
homologs (Fig. 55). The method which finds that many 
relationships is wu-blastz. Consequently, we infer that the 
differences between fasta kup = l, ssearch, and wu-blastz 
programs are unlikely to be significant when compared with 
variation m database composition and scoring reliability 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity than would be expected 
by chance, ssearch with E-values can recognize >90% of the 
homologous pairs with 30-40% identity. In this region, there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-values, but 26 of these involve sequences with <50 
residues. Of sequences having 25-30% identity, 75% are 
identified by ssearch E-values. However, although the num- 
ber of homologs grows at lower levels of identity, the detection 
falls off sharply: only 40% of homologs with 20-25% identity 
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Fig. 6 Distribution and detection of homologs in pdimod-b Bars 
show the distribution of homologous pairs PDB4or? B accordW f c the?r 
dentity (using the measure of identity in both). Filled region indicate 

^? i o^ Cr 0 ^ eSC , pai ? f ° Und by thc bcsl dataoase searching method 
(ssearch with E-values) at 1% EPQ. The PDB40D-B databaseWains 
proteins with <40% identity, and as shown on this graph most 
structurally ^entitled homologs in the database have ^verged ex- 
tremely far in sequence and have <20% identity. Note that the 
alignments may be inaccurate, especially at low levels of identity Fated 
regions show that ssearch can identify most relationships that have 
or more identity, but its detection wanes sharply below 25% 
Consequently the great sequence divergence of most structurally* 
identified evolutionary relationships effectively defeats the ability of 
panwise sequence comparison to detect them. 

are detected and only 10% of those with 15-20% can be found 
These results show that statistical scores can find related 
proteins whose identity is remarkably low; however, the power 
of the method is restricted by the great divergence of many 
protein sequences. y 
After completion of this work, a new version of pairwise 
BLAST was released: blastgp (37). It supports gapped align- 
mente, like wu-blasT2, and dispenses with sum statistics. Our 
initial tests on blastgp using default parameters show that its 
E-values are reliable and that its overall detection of homoloes 
was substantially better than that of ungapped blast, but not 
quite equal to that of wu-blast2. 

CONCLUSION 

The general consensus amongst experts (see refs 7 24 25 27 
and references therein) suggests that the most efiective'se- 
quence searches are made by (/) using a large current database 
in which the protein sequences have been complexity masked 
and (it) using statistical scores to interpret the results. Our 
experiments fully support this view. 

Our results also suggest two further points. First, the E-val- 
ues reported by fasta and ssearch give fairly accurate 
estimates of the significance of each match, but the P-values 
provided by blast and wu-blast2 underestimate the true 



Table 1 . Summary of sequence comparison methods with PDB40D-B 


Method 


Relative Time' 


1% EPQ Cutoff 


Coverage at 1% EPQ 


ssearch % identity: within alignment 
ssearch % identity: within both 
ssearch % identity: Hssp-scaled 
ssearch Smith- Water man raw scores 
ssearch E-values 
Fast a ktup = 1 E-values 
Fasta ktup = 2 E-values 
WU-BLAST2 P-values 
blast P-values 


253 
253 
253 
25.5 
253 
3.9 
1.4 
1.1 
1.0 


>70% 
34% 

35% (HSSP + 9.8) 
142 
0.03 
0.03 
0.03 
0.003 
0.00016 


<0.1 
3.0 
4.0 
10.5 
18.4 
17.9 
16.7 
17.5 
14.8 


nmes are irom large database searches with genome proteins. 
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extent of errors. Second, ssearch, wu-blast2, and fast a 
ktup = 1 perform best, though blast and fasta ktup = 2 
detect most of the relationships found by the best procedures 
and are appropriate for rapid initial searches. 

The homologous proteins that are found by sequence com- 
parison can be distinguished with high reliability from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationships at an acceptable error rate. 
Thus, if the procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique; rather, it 
indicates that any relatives it might have are distant ones.** 



Additional and updated information about this work, including 
supplementary figures, may be found at http://sss.stanford.edu/sss/. 
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